Stat Learning Notes IV2
Stat Learning Notes IV2
net/publication/338488412
CITATIONS READS
0 293
1 author:
S.B. Vardeman
Iowa State University
124 PUBLICATIONS 1,466 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Decision Theory for Sets and Sequences of Decision Problems View project
All content following this page was uploaded by S.B. Vardeman on 18 June 2021.
Abstract
This set of notes is the most recent reorganization and update-in-
progress of Modern Multivariate Statistical Learning course material de-
veloped 2009-2020 over 7 o¤erings of PhD-level courses and 4 o¤erings of
an MS-level course in the Iowa State University Statistics Department, a
short course given in the Statistics Group at Los Alamos National Lab,
and two o¤ered through Statistical Horizons LLC. Early versions of the
courses were based mostly on the topics and organization of The Elements
of Statistical Learning by Hastie, Tibshirani, and Friedman, though very
substantial parts bene…ted from Izenman’s Modern Multivariate Statis-
tical Techniques, and from Principles and Theory for Data Mining and
Machine Learning by Clarke, Fokoué, and Zhang.
1
Contents
2
2.3 The Singular Value Decomposition of X . . . . . . . . . . . . . . 56
2.3.1 The Singular Value Decomposition and General Inner Prod-
uct Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4 Matrices of Centered Columns and Principal Components . . . . 59
2.4.1 "Ordinary" Principal Components . . . . . . . . . . . . . 59
2.4.2 "Kernel" Principal Components . . . . . . . . . . . . . . . 63
2.4.3 Graphical (Spectral) Features . . . . . . . . . . . . . . . . 64
3
7.2 Projection Pursuit Regression . . . . . . . . . . . . . . . . . . . . 108
4
12 Basic Linear (and a Bit on Quadratic) Methods of Classi…ca-
tion 156
12.1 Linear (and a bit on Quadratic) Discriminant Analysis . . . . . . 157
12.1.1 Dimension Reduction in LDA . . . . . . . . . . . . . . . . 159
12.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 161
12.3 Separating Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . 165
5
17 Some Methods of Unsupervised Learning 202
17.1 Association Rules/Market Basket Analysis . . . . . . . . . . . . . 202
17.1.1 The "Apriori Algorithm" and Use of its Output . . . . . . 204
17.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
17.2.1 Partitioning Methods ("Centroid"-Based Methods) . . . . 207
17.2.2 Hierarchical Methods . . . . . . . . . . . . . . . . . . . . 208
17.2.3 (Mixture) Model-Based Methods . . . . . . . . . . . . . . 210
17.2.4 Biclustering . . . . . . . . . . . . . . . . . . . . . . . . . . 211
17.2.5 Self-Organizing Maps . . . . . . . . . . . . . . . . . . . . 214
17.3 Multi-Dimensional Scaling . . . . . . . . . . . . . . . . . . . . . . 218
17.4 More on Principal Components and Related Ideas . . . . . . . . 220
17.4.1 "Sparse" Principal Components . . . . . . . . . . . . . . . 220
17.4.2 Non-negative Matrix Factorization . . . . . . . . . . . . . 221
17.4.3 Archetypal Analysis . . . . . . . . . . . . . . . . . . . . . 222
17.4.4 Independent Component Analysis . . . . . . . . . . . . . 222
17.4.5 Principal Curves and Surfaces . . . . . . . . . . . . . . . . 225
17.5 (Original) Google PageRanks . . . . . . . . . . . . . . . . . . . . 228
VI Miscellanea 230
18 Graphs as Representing Independence Relationships in Multi-
variate Distributions 230
18.1 Some Considerations for Directed Graphical Models . . . . . . . 231
18.2 Some Considerations for Undirected Graphical Models . . . . . . 233
18.2.1 Restricted Boltzmann Machines . . . . . . . . . . . . . . . 235
6
A.10 Section 3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 281
A.11 Section 3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 285
A.12 Section 4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 287
A.13 Section 4.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 289
A.14 Section 4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 290
A.15 Section 5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 291
A.16 Section 5.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 292
A.17 Section 5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 293
A.18 Section 6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 293
A.19 Section 6.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 298
A.20 Section 7.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 299
A.21 Section 8.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 299
A.22 Section 8.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 302
A.23 Section 9.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 303
A.24 Section 10.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 306
A.25 Section 10.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 307
A.26 Section 11.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 308
A.27 Section 11.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 310
A.28 Section 11.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 311
A.29 Section 12.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 315
A.30 Section 12.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 317
A.31 Section 13.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 319
A.32 Section 13.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 319
A.33 Section 13.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 320
A.34 Section 14 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 323
A.35 Section 15.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 323
A.36 Section 15.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 324
A.37 Section 15.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 325
A.38 Section 17.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 326
A.39 Section 17.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 326
A.40 Section 17.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 328
A.41 Section 18.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . 329
A.42 "General/Comprehensive" Exercises . . . . . . . . . . . . . . . . 329
7
Part I
Introduction, Generalities, and
Some Background Material
1 Overview/Context
1.1 Notation and Terminology
These notes are about "statistics for ‘big data’" (AKA "machine learning" and
"data analytics"). We begin with the standard statistical notation and set-up
where one has data from N cases on p or p + 1 variables, x1 ; x2 ; : : : ; xp and
possibly y portrayed below:
Variables
x11 x12 x1p y1
x21 x22 x2p y2
Cases .. .. .. .. ..
. . . . .
xN 1 xN 2 xN p yN
for the case/row i set of x values (in column vector form unless otherwise indi-
cated) and
0 0 1 0 1
x1 y1
B x02 C B y2 C
B C B C
X =B . C; Y = B .. C ; and T = (X; Y )
N p @ .. A N 1 @ . A
x0N yN
8
Here we treat cases where at least one of N or p can be large and there is little
fundamental interest in model parameters or exactly how much we know about
them.
The ambivalence toward making statements concerning parameters of any
probability models employed (including those that would describe the amount
of variation in observables the model generates) is a fundamental di¤erence be-
tween a machine learning point of view and that common in basic graduate
statistics courses. This posture is perhaps sensible enough, as careful examina-
tion of a large training set will usually show that standard (tractable) probability
models are highly imperfect descriptions of complex situations.
Standard versions of problems addressed here are:
9
small dataset), fail to really make full use of the available information. There
is the possibility of either increasing "p" by (implicitly or explicitly) building
additional features from existing ones and/or simply using more sophisticated
and ‡exible forms for prediction (that go beyond, for example, the basic linear
form in the input variables of multiple linear regression). But there is also
the potential to "over-do" and e¤ectively make p too large or the predictor too
‡exible. One must somehow match predictor complexity to the real
information content of a (large) training set. It is this need and the
challenge it represents that makes the area interesting and important.
fx 2 <p j kxk 1g
For one thing, "most" of these distributions are very near the surface of the
p
solids. The cube [ r; r] capturing (for example) half the volume of the cube
(half the probability mass of the distribution) has
1=p
r = :5 (:5)
which converges to :5 as p increases. Essentially the same story holds for the
uniform distribution on the unit ball. The radius capturing half the probability
mass has
1=p
r = (:5)
which converges to 1 as p increases. Points uniformly distributed in these
regions are mostly near the surface or boundary of the spaces.
Another interesting calculation concerns how large a sample must be in order
for points generated from the uniform distribution on the ball or cube in an iid
fashion to tend to "pile up" anywhere. Consider the problem of describing the
distance from the origin to the closest of N points drawn iid uniformly from the
p-dimensional unit ball. With
R has cdf 8
< 0 r<0
F (r) = rp 0 r 1
:
1 r>1
10
So if R1 ; R2 ; : : : ; RN are iid with this distribution, M = min fR1 ; R2 ; : : : ; RN g
has cdf 8
< 0 m<0
N
FM (m) = 1 (1 mp ) 0 m 1
:
1 m>1
This distribution has, for example, median
!1=p
1=N
1 1
FM (:5) = 1
2
For, say, p = 100 and N = 106 , the median of the distribution of M is :87.
In retrospect, it’s not really all that hard to understand that even "large"
sets of points in <p must be sparse. After all, it’s perfectly possible for p-vectors
u and v to agree perfectly on all but one coordinate and be far apart in p-space.
There simply are a lot of ways for two p-vectors to di¤er!
In addition to these kinds of considerations of sparsity, there is the fact
that the potential complexity of functions of p variables explodes exponentially
in p. CFZ point this out and go on to note that for large p, all datasets
exhibit multicollinearity (or its generalization to non-parametric …tting) and its
accompanying problems of reliable …tting and extrapolation. These and related
issues together constitute what is often called the curse of dimensionality.
The curse implies that even "large" N doesn’t save one and somehow make
practical large-p prediction only a trivial application of standard parametric or
non-parametric regression methodology. And when p is large, it is essentially
guaranteed that if one uses a method that is "too" ‡exible in terms of the rela-
tionships between x and y that it permits, one will be found, real/fundamental/
reproducible or not. That is, the (common for large p) possibility that a dataset
is (sparse and) not really adequate to support the use of a (‡exible) supervised
statistical learning method can easily lead to over…tting. This is the pres-
ence of what appears to be a strong pattern in a (sparse) training set that
generalizes/extrapolates poorly to cases outside the training set.
In light of the foregoing, one standard way of choosing among various "big
data" statistical procedures for a given dataset is to de…ne both 1) a reliable
measure of estimated/predicted performance (like an estimated prediction mean
square error) and 2) a measure of complexity (like an "e¤ective number of …tted
parameters") for a predictor. Then one attempts to optimize (by choice of
complexity measure) the predicted performance. In light of the over…tting
issue, the method predicting performance almost always employs some form of
"holdout" sample, whereby performance is evaluated using data not employed
in …tting/predictor development.2
2 This approach potentially addesses the detection of both over…tting and "model bias"
(where a …tted form is simply not adequate to represent the relationship between input vari-
ables and a target).
11
1.3 Some Initial Generalities About Prediction
1.3.1 Representing What is Known: Creating a Training Set for
Prediction
We began exposition with an N (p + 1) data matrix conceptually already in
hand. It is important to say that in real predictive analytics problems, the
reduction of all information available and potentially relevant to explaining y to
values of p predictor variables3 (that encode relevant "features" of the N cases) is
an essential and highly critical activity. If one de…nes good features/variables
(ones that e¤ectively and parsimoniously represent the N cases), then sound
statistical methodology has a chance of being practically helpful. Poor initial
choice of features limits how well one can hope to do in prediction.
This is particularly important to bear in mind where information from many
disparate databases or sources is used to create the training set/data matrix T
available for statistical analysis. In this way, in many applications of modern
data analytics the hard work begins substantially before the formal technical
subjects addressed in these notes come into play, and the quality of the work in
those initial steps is critical to ultimate success. All that follows in these
notes takes the particular form of training set adopted by a data
analyst as given, and that choice governs and limits what is possible
in terms of e¤ective prediction.
We should also note that in a typical analytics problem, variables represented
by the columns of a data matrix are in di¤erent units and often represent con-
ceptually di¤erent kinds of quantities (e.g., one might represent a voltage while
another represents a distance and another represents a temperature). In some
kinds of analyses this is completely natural and causes no logical problems. But
in others (particularly ones based on inner products of data vectors or distances
between them and/or where sizes of multipliers of particular variables in a linear
combination of those variables are important) one gets fundamentally di¤erent
results depending upon the scales used.
One surely doesn’t want to be in the position of having ultimate predictions
depend upon whether a distance (represented by a coordinate of x) is expressed
in km or in nm. And the whole notion of the <2 distance between two data
vectors where the …rst coordinate of each is a voltage
q and the second is a tem-
2 2
perature seems less than attractive. (What is (3 kV) + (2 K) supposed to
mean?)
A sensible approach to eliminating logical di¢ culties that arise in using
methods where scaling/units of variables matters, is to standardize predictors
x (and center any quantitative response variable, y) before beginning analysis.
That is, if a raw feature x has in the training set a sample standard deviation4
3 This is at least one common meaning of the term "data mining."
4 While it doesn’t really matter which one uses, the "N " divisor in
p place of the "N 1"
divisor seems slightly simpler as it makes the columns have <N norm N as opposed to norm
p
N 1.
12
sx and a sample mean x, one replaces it with a feature
x x
x0
sx
(thereby making all features unit-less). Conclusions about standardized input
x0 and centered response y 0 = y y then translate naturally to conclusions about
the raw variables via
x = sx x0 + x and y = y 0 + y
f (x) = E [yjx]
13
Classi…cation In a classi…cation context, where y takes values in G = f1; 2; : : : ; Kg
(or, completely equivalently, G = f0; 1; : : : ; K 1g), one might use the (0-1) loss
function
L (b
y ; y) = I [b y 6= y]
An optimal f corresponding to form (1) is then
X
f (x) = arg min P [y = vjx]
a
v6=a
14
1.3.3 Nearest Neighbor Rules
One idea that create a spectrum of predictor ‡exibilities (including extremely
high ones) is to operate completely non-parametrically and to think that if N
is "big enough," something like
1 X
yi (5)
# of xi = x
i with
xi =x
fb(x) = m (x)
One might hope that upon allowing k to increase with N (provided that P is
not too bizarre–one is counting, for example, on the continuity of E[yjx] in x)
these could be e¤ective predictors. They are surely (for small k) highly ‡exible
15
predictors and they and things like them often fail to be e¤ective because of
the curse of dimensionality. (In high dimensions, k-neighborhoods are almost
always huge in terms of their extent. There are simply too many ways that a
pair of training inputs xi can di¤er.
It is worth noting that
1 X
ma (x) = I [yi = a]
k
i with
xi 2nk (x)
E [I [y = a] jx] = P [y = ajx]
in a K-class model, and for some purposes knowing this is more useful than
knowing the 0-1 loss k-nn classi…cation rule.
Ultimately, one should view the k-nn idea as an important, almost decep-
tively simple, and highly useful one. k-nn rules are approximately optimal
predictors (for both SEL and 0-1 loss problems) that span a full spectrum of
complexities/‡exibilities speci…ed by the simple parameter k (the neighborhood
size). Whether or not they can be e¤ective in a given application depends upon
the size of p and N and the extent to which there is some useful structure latent
in the distribution of xs in the input space (mitigating the e¤ects of the curse
of dimensionality).
16
Figure 1: Optimal, Restricted Optimal, and Fitted Predictors
we have
This says that Err for the training-set-dependent predictor is the sum of three
terms. The …rst is the minimum possible error. The second is the non-negative
di¤erence between the best that is possible using a predictor constrained to be
an element of S and the absolute best that is possible. The third is the non-
negative di¤erence between Err (that involves averaging over the training-set-
directed random choices of elements from S, none of which can have average
loss over (x; y) better than that of g ) and the best that is possible using a
predictor constrained to be an element of S (namely the average loss of g (x)).
So relationship (8) might be rewritten as
Err can be in‡ated because S is too small (inducing model bias) or because
the sample size and/or …tting method are inadequate to make gT consistently
approximate g .
17
1.3.5 A More Detailed Decomposition for Err in SEL Prediction and
Variance-Bias Trade-o¤
In the context of squared error loss, a more detailed decomposition of Err pro-
vides additional insight into the di¢ culty faced in building e¤ective predictors.
Note that a measure of the e¤ectiveness of the predictor f^ at x (under
squared error loss) is what we might call
2
Err (x) ET E f^ (x) y jx (9)
For some purposes, other conditional versions of Err might be useful and ap-
propriate. For example
2
ErrT E(x;y) f^ (x) y
The …rst quantity in this decomposition, VarT f^ (x) , is the variance of the
2
prediction at x. The second term, ET f^ (x) E [yjx] , is a kind of squared
bias of prediction at x. And Var[yjx] is an unavoidable variance in outputs
at x. Highly ‡exible prediction forms may give small prediction biases at the
expense of large prediction variances. One may need to balance the two o¤
against each other when looking for a good predictor.
Now from expressions (10) and (11)
The …rst term on the right here is the average (according to the marginal of x)
of the prediction variance at x. The second is the average squared prediction
18
bias. And the third is the average conditional variance of y (and is not under
the control of the analyst choosing f^ (x)). Consider a further decomposition of
the second term.
Suppose that T is used to select a function (say gT ) from some linear sub-
2
space, say S = fgg, of the space functions h with Ex (h (x)) < 1, and that
ultimately one uses as a predictor
f^ (x) = gT (x)
Since linear subspaces are convex
g ET gT = ET f^ 2 S
Further, suppose that
2
g arg minEx (g (x) E [yjx])
g2S
is the projection of (the function of x) E[yjx] onto the space S. Then write
h (x) = E [yjx] g (x)
so that
E [yjx] = g (x) + h (x)
Then, it’s a consequence of the facts that Ex (h (x) g (x)) = 0 for all g 2 S
and therefore that Ex (h (x) g (x)) = 0 and Ex (h (x) g (x)) = 0, that
2
2
Ex ET f^ (x) E [yjx] = Ex (g (x) (g (x) + h (x)))
2 2
= Ex (g (x) g (x)) + Ex (h (x))
2Ex ((g (x) g (x)) h (x))
2
2
= Ex ET f^ (x) g (x) + Ex (E [yjx] g (x)) (13)
The …rst term on the right in the last line of display (13) is an average squared
…tting bias, measuring how well the average (over T ) predictor function approx-
imates the element of S that best approximates the conditional mean function.
This is a measure of how appropriately the training data are used to pick out
elements of S. The second term on the right is an average squared model bias,
measuring how well it is possible to approximate the conditional mean function
E[yjx] by an element of S. This is controlled by the size of S, or e¤ectively the
‡exibility allowed in the form of f^. Average squared prediction bias can thus
be large because the form …t is not ‡exible enough, or because a poor …tting
method is employed.
Then using expressions (12) and (13)
2
Err = Ex Var [yjx] + Ex (E [yjx] g (x))
2
+ Ex ET f^ (x) g (x) + Ex VarT f^ (x)
19
So this SEL decomposition of Err is related to the general one in display (8) in
that
and
expected (across x) expected (across x)
…tting penalty = +
squared …tting bias prediction variance
2
= Ex ET f^ (x) g (x) + Ex VarT f^ (x)
1. what is under the control of a data analyst, namely the modeling and
…tting penalties, has elements of both bias and variance and
2. complex predictors tend to have low bias and high variance in comparison
to simple ones
20
The most obvious/elementary means of approximating Err is the so-called
"training error"
N
1 X
err = L f^ (xi ) ; yi (14)
N i=1
The problem is that err is no good estimator of Err (or any other sensible quan-
ti…cation of predictor performance). It typically decreases with increased com-
plexity (without an increase for large complexity), and fails to reliably indicate
performance outside the training sample. The situation is like that portrayed
in Figure 2.
The fundamental point here is that one cannot both "…t" and "test" on the
same dataset and arrive at a reliable assessment of predictor e¢ cacy. Behaving
in such manner will almost always suggest use of a predictor that is too complex
and has a relatively large "test error" Err.
The existing practical options for evaluating likely performance of a predictor
(and guiding choice of appropriate complexity) then include the following.
1. One might employ some function of err that is a better indicator of likely
predictor performance, like Mallows’Cp , "AIC," and "BIC."
2. In genuinely large N contexts, one might hold back some random sample of
the training data to serve as a "test set," …t to produce f^ on the remainder,
and use X
1
L f^ (xi ) ; yi
size of the test set
i2 the
test set
3. One might employ sample re-use methods to estimate Err and guide choice
of complexity. Cross-validation and bootstrap ideas are used here.
21
We’ll say more about these possibilities later, but here describe the most
important of them, so-called cross-validation. K-fold cross-validation consists
of
a problem only if it is not constant across choices of predictors and their complexities.
22
Notice that unless K = N , even for …xed training set T , CV f^ is random,
owing to its dependence upon random assignment of training cases to folds. It
is thus highly attractive in cases where K < N is used, to replace CV f^ with
This is sometimes called the "one standard error rule of thumb" and is presum-
ably motivated by recognition of the uncertainty involved in cross-validation
(deriving from the randomness of 1) the selection of the training set and 2) the
partitioning of it into folds) and the desire to avoid over…tting. But (in light
of the dependence of the CVk f^ )) the validity of the supposed standard error
is at best quite approximate, and then the appropriateness of a "one standard
error rule" is not at all obvious.
The most obvious, aggressive, and logically defensible way of using CV f^
23
(or CV f^ ) to choose a predictor is to simply use the f^ minimizing the function
CV ( ) (or CV ( )). We will call this way of operating a "pick-the-(cross-
validation error)-winner rule."
It is an important and somewhat subtle point that if
f~ = arg minCV f^
f^
CV f^ ) can legitimately guide the choice of f^, its use is then actually part of a
larger program of "predictor development" than that represented by any single
argument of CV ( ) (or CV ( )). That being the case, in order to assess the
likely performance of f~, via cross-validation, inside each remainder T T k
one must
1. split into K folds,
2. …t on the K remainders,
3. predict on the folds and make a cross-validation error,
4. pick a winner for the function in 3., say f~k , and
5. then predict on T k using f~k .
It is the values f~k(i) (xi ) that are used in form (15) to predict the performance
of a predictor derived from optimizing a cross-validation error across a set of
predictors.
The basic principle at work here (and always) in making valid cross-validation
errors is that whatever one will ultimately do in the entire training set
to make a predictor must be redone (in its entirety!) in every re-
mainder and applied to the corresponding fold.
winner predictor.
24
Call the function optimizing this objective (over choices of g) for a given by
the name f^ . The smaller is , the more complex will be f^ .
As a simple example, consider p = 1 SEL prediction on < with standardized
input x. With S = 1 x + 2 x2 + 3 x3 j 1 ; 2 ; 3 are all real , using J [g] =
2 2
2 + 3 penalizes lack of linearity in a …tted cubic. Small produces essentially
least squares …tting of a cubic and large produces least squares …tting of a
line.
Applying this penalized …tting to each remainder T T k to produce K
predictors f^k , one can as in display (15) derive a cross-validation error corre-
sponding to as
N
1 X k(i)
CV ( ) = L f^ (xi ) ; yi
N i=1
(There is no loss of generality here. These could be densities with respect to the
simple arithmetic average of the K class-conditional distributions.) There is
important statistical theory concerning minimal su¢ ciency that promises that
regardless of the original dimensionality of x (namely, p) there is a (K 1)-
dimensional feature that carries all available information about y encoded in
x.
25
For K = 2 the 1-dimensional likelihood ratio statistic
p (xj1)
L (x) = (17)
p (xj0)
is "minimal su¢ cient." If one knew the value of L (x) one would know all x
has to say about y. An optimal single feature is L (x) : In a practical problem,
the closer that one can come to engineering features "like" L (x), the more
e¢ ciently/parsimoniously one represents the input vector x. Of course, any
monotone transform of L (x) is equally as good as L (x).
For K > 2, roughly speaking the K 1 ratios p (xjk) =p (xj0) (taken together)
form a minimal su¢ cient statistic for the model. This potentially isn’t quite
true because of possible problems where p (xj0) = 0. But it is true that with
PK 1
s (x) = k=0 p (xjk) the vector
(and many variants of it) is (are) minimal su¢ cient. To the extent that one
can engineer features approximating these K 1 ratios9 , one can parsimoniously
represent the input vector.
26
P
training cases with xi = x and yi = y P
(), let N:;y = x Nx;y be the number of
training cases with xi = x and Nx;: = y Nx;y be the number of training cases
withyi = y. The vector function of x
1
P^ (yjx) = (Nx;1 ; Nx;2 ; : : : ; Nx;K 1) (19)
Nx;:
27
corresponding empirical mean output
1 X
y (x) = yi
Nx
i with xi =x
28
A very clever and practically powerful development in machine learning has
been the realization that for some purposes, it is not necessary to map from <p
to a Euclidean space, but that mapping to a linear space of functions may be
helpful. That is, creation of new numerical features based on input vector x
can be thought of as transformation
T : <p ! <q
T : <p ! A
is non-negative de…nite. Then the space of functions that are linear combina-
tions of "slices" of K (x; z), i.e. functions of x of the form
M
X
cj K (x; z j )
j=1
hK ( ; z 1 ) ; K ( ; z 2 )iA K (z 1 ; z 2 ) (22)
and using the bilinearity of any inner product to see that then of necessity
*M M
+ M1 X
M2
X X X
c1j K ( ; z j ) ; c2j K ( ; z j ) = c1j c2j 0 K (z j ; z j 0 ) = c01 Kc2
j=1 j=1 A j=1 j 0 =1
1 0 See Section 2.1 for more concerning the meaning of this language.
29
for c01 = (c11 ; : : : ; c1M ), c02 = (c21 ; : : : ; c2M ), and M M matrix K with entries
K (z i ; z j ). This has the important special case that for c = c1 = c2
M
2 *M M
+
X X X
cj K ( ; z j ) = cj K ( ; z j ) ; cj K ( ; z j ) = c0 Kc
j=1 j=1 j=1 A
A
Of course,
Psince K de…nes the P inner product in A it also de…nes the distance
M M
between j=1 c1j K ( ; z j ) and j=1 c2j K ( ; z j )
0 1
M
X M
X q
0
dA @ c1j K ( ; z j ) ; c2j K ( ; z j )A = (c1 c2 ) K (c1 c2 )
j=1 j=1
30
Kernel Mechanics A direct way of producing a kernel function is through a
Euclidean inner product of vectors of "features." That is, if : X ! <m (so
that component j of , j , creates the univariate real feature j (x)) then for
h ; i the usual Euclidean inner product (dot product),
There are several probabilistic and statistical arguments that can lead to
forms for kernel functions. For example, a useful fact from probability theory
(Bochner’s Theorem) says that characteristic functions for p-dimensional dis-
tributions are non-negative de…nite complex-valued functions of s 2 <p . So if
(s) is a real-valued characteristic function, then
K (x; z) = (x z)
31
is a kernel function on <p <p . Related to this line of thinking are lists
of standard characteristic functions (that in turn produce kernel functions) and
theorems about conditions su¢ cient to guarantee that a real-valued function is a
characteristic function. For example, each of the following is a real characteristic
function for a univariate random variable (that can lead to a kernel on <1 <1 ):
And one theorem about su¢ cient conditions for a real-valued function on <1
to be a characteristic function says that if is symmetric ( ( t) = (t)),
(0) = 1, and is decreasing and convex on [0; 1), then is the characteristic
function of some distribution on <1 . (See Chung’s A Course in Probability
Theory, page 191.)
Bishop points out two constructions motivated by statistical modeling that
yield kernels that have been used in the machine learning literature. One is
this. For a parametric model on (a potentially completely abstract) X , consider
densities p (xj ) that when treated as functions of are likelihood functions (for
various possible observed x). Then for a distribution G for 2 ,
Z
K (x; z) = p (xj ) p (zj ) dG ( )
is a kernel. This is the inner product in the space of square integrable functions
on the probability space with measure G of the two likelihood functions. In
this space, the distance between the functions (of ) p (xj ) and p (zj ) is
sZ
2
(p (xj ) p (zj )) dG ( )
32
Then obviously, the corresponding kernel function is
Z Z
0 0 0
K (x; z) = x ( ) z ( ) dG ( ) or K (x; z) = x ( ) z ( ) dG ( )
Z
or K00 (x; z) = 00 00
x ( ) z ( ) dG ( )
(Of these three possibilities, centering alone is probably the most natural from
a statistical point of view. It is the "shape" of a loglikelihood that is important
in statistical context, not its absolute level. Two loglikelihoods that di¤er by a
constant are equivalent for most statistical purposes. Centering perfectly lines
up two loglikelihoods that di¤er by a constant.)
In a regular statistical model for x taking values in X with Euclidean para-
meter vector = ( 1 ; 2 ; : : : ; k ), the k k Fisher information matrix, say I ( ),
is non-negative de…nite. Then with score function
0 1
@
B @ 1 ln p (xj ) C
B @ C
B C
B ln p (xj ) C
B
r ln p (xj ) = B @ 2 C
.. C
B . C
B C
@ @ A
ln p (xj )
@ k
(for any …xed ) the function
0 1
K (x; z) = r ln p (xj ) (I ( )) r ln p (zj )
has been called the "Fisher kernel" in the machine learning literature. (It
follows from Bishops’s 7. and 8. that this is indeed a kernel function.) Note that
K (x; x) is essentially the score test statistic for a point null hypothesis about .
The implicit feature vector here is the k-dimensional score function (evaluated
at some …xed , a q basis for testing about ), and rather than Euclidean norm,
1
the norm kuk u0 (I ( )) u is implicitly in force for judging the size of
di¤erences in feature vectors.
33
in an implicit feature space (and subsequently clustering or deriving classi…ers,
and so on).
If one treats documents as simply sets of words (ignoring spaces and punctu-
ation and any kind of order of words) one simple set of features for documents
d1 ; d2 ; : : : ; dN is a set of counts of word frequencies. That is, for a set of p
words appearing in at least one document, one might take
u = b1 b2 bn where each bi 2 A
34
will be huge and then X huge and sparse (for ordinary N and jsj). And in
many contexts, sequence/order structure is not so "local" as to be e¤ectively
expressed by only frequencies of n-grams for small n.
One idea that seems to be currently popular is to de…ne a set of interesting
p
strings, say U = fui gi=1 and look for their occurrence anywhere in a document,
with the understanding that they may be realized as substrings of longer strings.
That is, when looking for string u (of length n) in a document s, we count every
di¤erent substring of s (say s0 = si1 si2 sin ) for which
s0 = u
It further seems that it’s common to normalize the rows of X by the usual
Euclidean norm, producing in place of xij the value
xij
(25)
kxi k
This notion of using features (24) or normalized features (25) looks attrac-
tive, but potentially computationally prohibitive, particularly since the "inter-
esting set" of strings U is often taken to be An . One doesn’t want to have to
compute all features (24) directly and then operate with the very large matrix
X. But just as we were reminded in Section 2.4.2, it is only XX 0 that is re-
quired to …nd principal components of the features (or to de…ne SVM classi…ers
or any other classi…ers or clustering algorithms based on principal components).
So if there is a way to e¢ ciently compute or approximate inner products for
rows of X de…ned by form (24), namely
0 10 1
X X X
hxi ; xi0 i = @ ln l1 +1 A @ mn m1 +1 A
it might be possible to employ this idea. And if the inner products hxi ; xi0 i
can be computed e¢ ciently, then so can the inner products
1 1 hxi ; xi0 i
xi ; xi0 =p
kxi k kxi0 k hxi ; xi i hxi0 ; xi0 i
35
needed to employ XX 0 for the normalized features (25). For what it is worth,
it is in vogue to call the function of documents s and t de…ned by
X X X
ln l1 +mn m1 +2
K (s; t) =
u2An sl1 sl2 sln =u tm1 tm2 tmn =u
the String Subsequence Kernel and then call the matrix XX 0 = (hxi ; xj i) =
¯ ¯ ¯
(K (si ; sj )) the Gram matrix for that "kernel." The good news is that there are
fairly simple recursive methods for computing K (s; t) exactly in O (n jsj jtj) time
and that there are approximations that are even faster (see the 2002 Journal
of Machine Learning Research paper of Lodhi et al.). That makes the implicit
use of features (24) or normalized features (25) possible in many text processing
problems.
which is, for example, not expressible in terms of two regions in <2 with linear
boundaries. However, if one de…nes the nonlinear transform T : <2 ! <5 by
0
T (x) = x1 ; x2 ; x21 ; x22 ; x1 x2
1 1 The theory of statistical su¢ ciency is concerned with what non-one-to-one transforms do
about the signal. But it does replace the signal with a set of variables that are potentially more
convenient than the original signal itself in terms of exisiting signal processing methodology.
36
then a very small amount of algebra shows that the classi…er can be written in
terms of a linear combination of coordinates of T (x) as
( 4; 4; 1; 1; 0) T (x) 7
That is, thought of as de…ned in terms of T (x) 2 <5 (in terms of the input
transformed to the higher-dimension space <5 ) the classi…er is de…ned by a very
simple linear (inner product) operation.
The toy example is instructive because it has characteristics of a strategy
that is commonly e¤ective in practice. That is one where a nonlinear transform
is employed to map training cases into a linear space in which simple operations
are used to de…ne a predictor. (It should be noted that in the event that x takes
values in a linear space like <2 , a linear transform has no potential to provide
the kind of advantage seen in the hypothetical example. That is because a linear
transform can only map the training set to a linear subspace of dimension no
more than that spanned by the original training set.)
Notice that the thinking here substantially blurs any perceived line between
"feature engineering" and "predictor …tting." They are both really parts of a
single process and one cannot be treated as inconsequential to the production of
the test error, Err (nor ignored it attempts to represent it empirically through
cross-validation).
It is also important to think clearly about what goes into the making of
transformed feature T (x). The intent of the notation T (x) is that the form
of the function T ( ) does not depend upon the training set. But sometimes
data "pre-processing" e¤ectively violates this understanding, making the form
of the function training-set-dependent. One might use notation like T (T ; ) to
represent this and this issue must be carefully handled in cross-validation.
That is, if one is contemplating use of a predictor built upon a training
set (T (T ; x1 ) ; y1 ) ; (T (T ; x2 ) ; y2 ) ; : : : ; (T (T ; xN ) ; yN ) and hopes to use K-fold
cross-validation to reliably predict predictor performance, …tting on remainder
k must be done using not values (T (T ; xi ) ; yi ) for cases in remainder k, but
rather values (T (T T k ; xi ) ; yi ). For example, as mentioned in Section 1.3.6,
when building predictors based on standardized inputs, standardization must be
done afresh for each new remainder! If the training set will be used to choose a
parameter of a kernel for use in de…ning abstract features associated with input
vectors, the same kind of choice must be made one remainder at a time, etc.
Failure to do so breaks the cross-validation paradigm and the basic maxim that
whatever is ultimately going to be done to make predictions must
be done in each individual remainder, i.e. must be done K times.
Typically, failure to follow this maxim will produce unduly optimistic (and
substantially wrong) supposed "cross-validation errors."
This matter seems particularly important to recognize in cases (like those
where a training set will be used to make approximate likelihood ratios per
Section 1.4.2) where the responses in the training set or remainder (not the
inputs only) are involved in the making of new features. The issue also raises
the question of exactly how best to use a training set (or remainder) to both 1)
37
choose T as a function of the training set (or remainder) and then 2) build a
predictor. Two possibilities are to 1) use the entire training set (or remainder) in
both steps, or to 2) randomly split the training set (or remainder) into two parts,
the …rst for use in choosing the form of T and the other for use in subsequently
building the prediction algorithm. Which of these (or some other version of
them) is likely to be most e¤ective is not clear. What is clear is that care
must be taken to "separately do in each remainder in a cross-validation all that
will be ultimately done with the full training set" if one is to produce reliable
cross-validation errors.
By far, the most important version of this is the K = 2 case. And for this case,
there are some very important additional general insights that we proceed to
discuss.
prove useful. For the time being, employ the …rst and abbreviate P [y = 1]
as (so that P [y = 0] = 1 ), and write p (xj1) and p (xj0) for the two
class-conditional densities for x. Then
p (xj1)
P [y = 1jx] = and (26)
p (xj1) + (1 ) p (xj0)
(1 ) p (xj0)
P [y = 0jx] =
p (xj1) + (1 ) p (xj0)
An optimal classi…er is then
38
makes connection to classical statistical theory and identi…es the optimal clas-
si…er as a Neyman-Pearson test of the simple hypotheses H0 : y = 0 versus
Ha : y = 1 with "cut-point" the ratio (1 )= .
As a slight generalization of this development, note that for l0 0 and l1 0
and an asymmetric loss
L (by ; y) = ly I [b
y 6= y]
an optimal classi…er is
(1 ) l0
f (x) = I L (x) >
l1
In fact, for a completely general choice of four losses L (b
y ; y) in a 2-class clas-
si…cation model, it is easy enough to argue that for L (1; 0) L (0; 0)
L (1; 1) + L (0; 1), L (1; 0) L (0; 0), and R = j j an optimal classi…er
is
f (x) = I [P [y = 1jx] > R]
which for R 2 (0; 1) is
(1 )R
f (x) = I L (x) >
(1 R)
and that
(1 )
P [y = 1jx]
L (x) =
1 P [y = 1jx]
So, for the time being subscripting P with either or depending upon which
marginal probability of y = 1 is operating (in models with the same class-
conditional densities p (xj1) and p (xj0)),
(1 ) P [y = 1jx]
1 P [y = 1jx]
P [y = 1jx] = (29)
(1 ) P [y = 1jx] (1 )
+
1 P [y = 1jx]
1 3 The terminology of "extreme class imbalance" is commonly used.
39
from which it is obvious how to translate an estimate of P [y = 1jx] made from
a synthetically balanced training set to one for the real situation described by
. Further, an optimal classi…er (27) or (28) is
P [y = 1jx] (1 )
I >
1 P [y = 1jx] (1 )
L (^
y ; y) = y ln y^ (1 y) ln (1 y^)
For reasons that will shortly become clear (in Section 1.5.3), it is sometimes
convenient to use not 0-1 coding but rather 1-1 coding in 2-class classi…cation
models, so that y is in f 1; 1g. Suppose that y^ is allowed to be any real number,
then three other (initially odd-looking) losses are sometimes considered, namely
L1 (^
y ; y) = ln (1 + exp ( y y^)) = ln (2) ,
L2 (^
y ; y) = exp ( y y^) , and
L3 (^
y ; y) = (1 y y^)+
40
For these losses, theoretically optimal predictors are respectively
P [y = 1jx]
f1 (x) = ln = ln L (x) ,
P [y = 1jx]
1 P [y = 1jx] 1
f2 (x) = ln = ln L (x) , and
2 P [y = 1jx] 2
f3 (x) = sign (P [y = 1jx] P [y = 1jx])
j
; p^1j for j = 1; 2; : : : ; M0 1; M0 (31)
M0
where if the test cases are arranged left to right as judged least-to-most likely
to have yi = 1;
p^1j = the fraction of yi = 1 cases to the right of the jth left-most yi = 0 case
If one then makes a step function from the plotted points (constant at the
vertical of a plotted point over the interval of length 1=M0 to its left) and then
computes the area under that "curve" one obtains an "AUC" (a …gure of merit
often used in predictive analytics contests). If the ordering of cases comes from
O, this area is
0 1
M0
1 X 1 X
@ 1 X
AU C = p^1j = I [O (xi ) < O (xj )]A (32)
M0 j=1 M0 i s.t. y =0 M1 j s.t. y =1
i j
41
IP is exactly the criterion (30) and in the event that G0 (t) is continuous and
increasing (and thus has an inverse) this is
Z 1
IP = 1 G1 G0 1 (u) du
0
Notice too that if for each t one builds from O a classi…er of the form
1.5.3 "Voting Functions," Losses for Them, and Expected 0-1 Loss
The fact that empirical search for a good 2-class classi…er is essentially search
for a good approximation to the likelihood ratio function L (x) raises another
kind of consideration for 2-class problems. That is the possibility of focusing
on the building of a good "voting function" g (x) to underlie a classi…er.
42
For the time being, it’s now convenient to employ the 1-1 coding of class la-
bels (use G = f 1; 1g) and to without much loss of generality consider classi…ers
de…ned for an arbitrary voting function g (x) by
(except for the possibility that g (x) = 0, that typically has 0 probability for
both classes). Then an optimal voting function for 0-1 loss is
p (xj1) P [y = 1]
g opt (x) = (35)
p (xj 1) P [y = 1]
With this notation, a classi…er f (x) = sign(g (x)) produces 0-1 loss neatly
written as
L (^
y ; y) = I [yg (x) < 0]
(a loss of 1 is incurred when y and g (x) have opposite signs). So the the 0-1
loss expected loss/error rate has the useful representation
We have seen that a function g optimizing the average value (36) is g opt (x)
de…ned in (35). But the indicator function I [u < 0] involved in (36) is discon-
tinuous (and thus non-di¤erentiable), and for some purposes it would be more
convenient to work with a continuous (even di¤erentiable) one in making an
empirical choice of voting function.
If I [u < 0] h (u), it is obvious that
So the right hand side of display (37) functions as an upper bound for the 0-1
loss error rate and an approximate (data-based) minimizer of that right hand
side used as a voting function can be expected to control 0-1 loss error rate.
Several di¤erent continuous choices of "loss" h (u) can be viewed as motivating
popular methods of (voting function and) classi…er development. These include:
For reference, the indicator function I [u < 0] and the functions h1 (u) ; h2 (u) ;
and h3 (u) are plotted together in Figure 4.
One reason why this line of argument proves e¤ective is that not only does
bound (37) hold, but minimizers of a Eh (yg (x)) over choice of function g for
43
Figure 4: "Losses" I [u < 0] in black, h1 (u) in red, h2 (u) in blue, and h3 (u) in
green.
standard choices of h with h (u) I [u < 0] are directly related to the likelihood
ratio. This can be seen using the results concerning optimal predictors in 2-class
classi…cation models from Section 1.5.2. That is,
P [y = 1jx]
Eh1 (yg (x)) = EL1 (g (x) ; y) has optimizer g1opt (x) = ln
P [y = 1jx]
1 P [y = 1jx]
Eh2 (yg (x)) = EL2 (g (x) ; y) has optimizer g2opt (x) = ln and
2 P [y = 1jx]
P [y = 1jx]
Eh3 (yg (x)) = EL3 (g (x) ; y) has optimizer g3opt (x) = sign 1
P [y = 1jx]
The …rst two functions are are monotone transformations of the likelihood ratio
and when used as a voting function produce a (0-1 loss) optimal classi…er. The
third is the optimal classi…er itself.
So empirical search for optimizers of (an empirical version of) the risk
Eh (yg (x)) can produce good classi…ers. This has the fascinating e¤ect of
making SEL prediction and classi…cation look very much alike. Ultimately,
in development of a predictor, one is searching among some class of functions,
S, for a real-valued g making an appropriate empirical approximation of a risk
measure small.
44
in the optimal form (2) to produce
Initially suppose that p = 1. For g ( ) some …xed pdf (like, for example,
the standard normal pdf), invent a location-scale family of densities on < by
de…ning (for "bandwidth" > 0)
1
h( j ; ) = g
K (; ) g
45
Figure 5: A p = 1 density, a corresponding sample of N = 100 values x, and
three density estimates based on di¤erent bandwidths.
46
Figure 7: 6 samples of size N = 100 from the bivariate density of Figure 6
and density estimates made using the kde2d function in the MASS package with
default choice of "bandwidth" covariance matrix.
Figure 8: Two N = 100 density estimates and their ratio for classifcations
2
between Uniform [ 3; 3] and the distribution of Figure 6.
That is, consider the case where one uses as a multivariate density estimate
N
d 1 X 2
q (x) = xjxi ; I
N i=1
(where ( j ; ) is the MVNp density with mean vector and covariance matrix
). A bit of algebra shows that with this kind of multivariate estimates of
class-conditional densities (based on the parts of the training set with y = k)
(and using training set relative frequencies to estimate class probabilities) the
approximately Bayes classi…er is
X 1 2
f^ (x) = arg max exp 2
kx xi k
k 2
i s.t. yi =k
47
relatively many training inputs from class k. The bandwidth might be chosen
based on cross-validation of classi…er performance.
The statistical folklore is that this kind of classi…er can work poorly in high
dimensions because of the imprecisions (large variances) of the density estima-
tors. The "estimated density" approximations to the optimal rule are based
on what are usually low-bias-but-high-variance estimators. As such, the corre-
sponding classi…ers are very ‡exible, but can perform poorly for small training
sets. Less ‡exible classi…cation methods will often perform much better in prac-
tical problems (although those methods may be incapable of approximating the
optimal rule for all cases, even if N is huge).
There is a variant of form (38) that is thought to sometimes be e¤ective even
when p is not small (and p-dimensional density estimation is hopeless). The
basic idea is to estimate 1-dimensional marginals of the p (xjk)s and use their
products in place of p\(xjk)s. That is, if for each k the density p (xjk) : <p ! <+
has marginal densities p1 (x1 jk) ; p2 (x2 jk) ; : : : ; pp (xp jk) (each mapping < !
<+ ), while it may not be feasible to estimate p (xjk), it could be possible to
e¤ectively estimate p1 (x1 jk) ; p2 (x2 jk) ; : : : ; pp (xp jk). If this is the case, the
classi…er
Yp
f^ (x) = arg maxP \ [y = k] pj\ (xj jk)
k j=1
might be employed. (That is, one might treat elements xj of x as if they were
independent for every k, and multiply together kernel estimates of marginal
densities.) This has been called a "naive Bayes" classi…er.
The method seems to have a reputation for often being useful. But there
will certainly be situations where it doesn’t work very well because of failure
to account for strong dependencies between input variables. Figure 9 shows
the common marginal for x1 and x2 corresponding to the distribution of Figure
6. Figure 10 then shows the original bivariate density and the distribution of
independence with the same marginals. The product density is clearly quite
di¤erent from the original and estimation of the marginals alone can at best
only reproduce the product form.
Figure 9: The common marginal pdf for both x1 and x2 for the bivariate distri-
bution of Figure 6.
48
Figure 10: Original bivariate density from Figure 6 and a product density based
on the marginal(s) (as pictured in Figure 9).
This might be plotted (e.g. in contour plot fashion) and the plot called a partial
dependence plot for the variables x1 and x2 . HTF’s language is that this
function details the dependence of the predictor on (x1 ; x2 ) "after accounting
for the average e¤ects of the other variables." This thinking amounts to a
version of the kind of thing one does in ordinary factorial linear models, where
main e¤ects are de…ned in terms of average (across all levels of all other factors)
means for individual levels of a factor, two-factor interactions are de…ned in
terms of average (again across all levels of all other factors) means for pairs of
levels of two factors, etc.
Something di¤erent raised by HTF is consideration of
49
(This is, by the way, the function of (x1 ; x2 ) closest to f (x) in L2 (P ).) This is
obtained by averaging not against the marginal of the excluded variables, but
against the conditional distribution of the excluded variables given x1 and x2 .
No workable empirical version of fe12 (x1 ; x2 ) can typically be de…ned. And it
should be clear that this is not the same as f 12 (x1 ; x2 ). HTF say this in some
sense describes the e¤ects of (x1 ; x2 ) on the prediction "ignoring the impact
of the other variables." (In fact, if f (x) is a good approximation for E[yjx],
this conditioning produces essentially E[yjx1 ; x2 ] and fe12 is just a predictor of
y based on (x1 ; x2 ).)
The di¤erence between f 12 and fe12 is easily seen through resort to a simple
example. If, for example, f is additive of the form
then
while
fe12 (x1 ; x2 ) = h1 (x1 ; x2 ) + E [h2 (x3 ; : : : ; xp ) jx1 ; x2 ]
and the fe12 "correction" to h1 (x1 ; x2 ) is not necessarily constant in (x1 ; x2 ).
The upshot of all this is that partial dependence plots are potentially helpful,
but that one needs to remember that they are produced by averaging according
to the marginal of the set of variables not under consideration.
50
vector spaces are the Euclidean spaces <p where elements are "ordinary" p-
dimensional vectors. But other kinds of vector spaces are useful in statistical
machine learning as well, including function spaces. Take for example the set
of functions on [0; 1] that have …nite integrals of their squares. (This space
is sometimes known as L2 ([0; 1]).) More or less obviously, if g : [0; 1] ! <
R1 2
with 0 (g (x)) dx < 1 and a 2 <, then ag (x) makes sense, maps [0; 1] to <
R1 2 R1 2
and has 0 (ag (x)) dx = a2 0 (g (x)) dx < 1. Further, if g : [0; 1] ! <
R1 2 R1 2
with 0 (g (x)) dx < 1 and h : [0; 1] ! < with 0 (h (x)) dx < 1, then the
function g (x) + h (x) makes sense, maps [0; 1] to < and has …nite integral of its
square.
The notion of an inner product (of pairs of elements of a vector space V ) is
that of a symmetric (bi-)linear positive de…nite function hv; wi mapping V
V ! <. That is, hv; wi is an inner product on the vector space V if it satis…es
Of course Euclidean p-space is a vector space with inner product de…ned as the
"dot-product" of p-dimensional vectors v and w, namely
p
X
hv; wi = v 0 w = vj wj
j=1
It is possible to argue that in the case of the L2 ([0; 1]) function space, the
integral of the product of two elements provides a valid inner product, that is
Z 1
hg; hi g (x) h (x) dx
0
satis…es 1. through 3.
An inner product on a vector space V leads immediately to notions of size
and distance in the space. The norm (i.e. the "size" or "length") of an element
of V can be coherently de…ned as
p
kvk hv; vi
Then the distance between two elements of V can be taken to be the size of the
di¤erence between them. That is, the distance between v and w belonging to
V (say d (v; w)) derived from the inner product is
d (v; w) = kv wk
51
This satis…es all the properties necessary to qualify as a "metric" or "distance
function," including the important triangle inequality.
In Euclidean p-space, the norm is the geometrical length of a p-vector (the
root of the sum of the p squared entries of the vector) and the associated distance
is ordinary Euclidean distance. In the case of the L2 ([0; 1]) function space, the
norm/size of an element g is
s
Z 1
2
kgk = (g (x)) dx
0
52
of …tted values 0 1
yb1
B yb2 C
B C
Yb = B .. C
N 1 @ . A
ybN
For many purposes it would be convenient if the columns of a full rank
(rank = p) matrix X were orthogonal. In fact, it would be useful to replace
the N p matrix X with an N p matrix Z with orthogonal columns and having
the property that for each l if X l and Z l are N l consisting of the …rst l columns
of respectively X and Z, then C (Z l ) = C (X l ). Such a matrix can in fact be
constructed using the so-called Gram-Schmidt process. This process generalizes
beyond the present application to <N to general inner product spaces, and in
recognition of that important fact we’ll …rst describe it in general terms and
then consider its implications for a (rank = p) matrix X.
Consider p vectors x1 ; x2 ; : : : ; xp (that could be N -vectors where xj is the
jth column of X)14 . The Gram-Schmidt process proceeds as follows:
1. Set
1=2 1
z 1 = x1 and q 1 = hz 1 ; z 1 i z1 = z1
kz 1 k
hz l ; z j i = hxl ; z j i hxl ; z j i
as at most one term of the sum in step 2. above is non-zero. Further, assume
that the span of fz 1 ; z 2 ; : : : ; z l 1 g is the same as the span of fx1 ; x2 ; : : : ; xl 1 g.
z l is in the span of fx1 ; x2 ; : : : ; xl g so that the span of fz 1 ; z 2 ; : : : ; z l g is a
subset of the span of fx1 ; x2 ; : : : ; xl g. And since any element of the span of
fx1 ; x2 ; : : : ; xl g can be written as a linear combination of an element of the span
of fz 1 ; z 2 ; : : : ; z l 1 g (span of fx1 ; x2 ; : : : ; xl 1 g) and xl we also have that the
1 4 Notice that this is in potential con‡ict with earlier notation that made x the p-vector
i
of inputs for the ith case in the training data. We will simply have to read the following in
context and keep in mind the local convention.
53
Figure 11: A p = 2 illustration of the Gram-Schmidt construction of an ortho-
normal basis for the subspace space spanned by x1 and x2 .
q1 ; q2 ; : : : ; ql
Xl Xl
hw; z j i
zj = w; q j q j (39)
j=1
hz j ; z j i j=1
54
space. So the projection of a vector of outputs Y onto C (X l ) is
Xl Xl
hY ; z j i
zj = Y ; qj qj
j=1
hz j ; z j i j=1
hY ; z p i
hz p ; z p i
is the regression coe¢ cient for z p and (since only z p involves it) the last vari-
able in X, xp . So, in constructing a vector of …tted values, …tted regression
coe¢ cients in multiple regression can be interpreted as weights to be applied to
that part of the input vector that remains after projecting the predictor onto
the space spanned by all the others.
The construction of the orthogonal variables z j can be represented in matrix
form as
X = Z
N p N pp p
De…ning
1=2 1=2
D = diag hz 1 ; z 1 i ; : : : ; hz p ; z p i = diag (kz 1 k ; : : : ; kz p k)
and letting
1
Q = ZD and R = D
one may write
X = QR (40)
that is the so-called QR decomposition of X.
Note that the notation used here is consistent, in that for q j the jth column
1=2
of Q, q j = (hz j ; z j i) z j as was used in de…ning the Gram-Schmidt process.
In display (40), Q is N p with
Q0 Q = D 1
Z 0 ZD 1
=D 1
diag (hz 1 ; z 1 i ; : : : ; hz p ; z p i) D 1
=I
consistent with the fact that Q has for columns perpendicular unit vectors that
form a basis for C (X). R is upper triangular and that says that only the …rst
l of these unit vectors are needed to create xl .
55
The decomposition is computationally useful in that the projection of a
response vector Y onto C (X) is
p
X
Yb = Y ; q j q j = QQ0 Y (41)
j=1
and
b ols = R 1
Q0 Y
(The fact that R is upper triangular implies that there are e¢ cient ways to
compute its inverse.)
d1 d2 dr > 0
56
Now using the singular value decomposition of a full rank (rank p) X,
X 0 X = V D 0 U 0 U DV 0
= V D2 V 0 (43)
which is the eigen (or spectral) decomposition of the symmetric and positive
de…nite X 0 X. (The eigenvalues are the squares of the SVD singular values.)
The vector
z 1 Xv 1
is the product Xw with the largest squared length in <N subject to the con-
straint that kwk = 1: A second representation of z 1 is
0 1
1
B 0 C
B C
z 1 = Xv 1 = U DV 0 v 1 = U D B . C = d1 u1
@ .. A
0
and we see that this largest squared length is d21 and the vector points in the
direction of u1 . In general,
0 1
hx1 ; v j i
B .. C
z j = Xv j = @ . A = dj uj (44)
hxN ; v j i
is the vector of the form Xw with the largest squared length in <N subject to
the constraints that kwk = 1 and hw; z l i = 0 for all l < j. The squared length
is d2j and the vector points in the direction of uj .
X represents the vectors w1 ; w2 ; : : : ; wN in the sense that its rows give coe¢ -
cients to be applied to the elements of the orthonormal basis (the es) in order
to make linear combinations that are the ws.
57
Now, as above, consider the SVD of X and some related elements of A.
Begin with elements of A related to the right singular vectors v j 2 <r . Corre-
sponding to them are vectors
r
X
aj = vjl el
l=1
(the real entries of v j supplying coe¢ cients for the es in order to make up aj
as a linear combination of the basis vectors). Notice that
r
X
haj ; aj 0 iA = vjl vj 0 l = I [j = j 0 ]
l=1
r
! 2
X
= v1l hwi ; el iA
l=1 i=1;2;:::;N
2
= kXv 1 k
has the maximum value of
N
* r
+2 * r
+ ! 2
X X X
wi ; cl el = wi ; cl el
i=1 l=1 A l=1 A i=1;2;:::;N
r
! 2
X
= cl hwi ; el iA
l=1 i=1;2;:::;N
2
= kXck
Pr
possible for c a unit vector in <r and thus l=1 cl el a unit vector in A. That is,
a1 is a unit vector in A pointing in a direction such that the projections of the
wi onto the 1-dimensional subspace of multiples of it have the largest possible
sum of squared norms. In general, aj is a unit vector in A perpendicular to all
of a1 ; a2 ; : : : ; aj 1 with maximum sum of squared norms for the projections of
the wi onto the 1-dimensional subspace of multiples of it.
In a case where a transform T maps <p to an inner product space A, and
one is interested in the subspace of A of dimension r N spanned by the image
of the set of training input vectors
fT (x1 ) ; T (x2 ) ; : : : ; T (xN )g ;
with wi = T (xi ), the foregoing then translates the SVD of the matrix (45)
into abstract inner product space geometrical insights concerning transformed
training vectors.
58
2.4 Matrices of Centered Columns and Principal Compo-
nents
In the event that all the columns of X have been centered (each 10 xj = 0 for
xj the jth column of X), there is additional terminology and insight associated
with singular value decompositions as describing the structure of X. Note
that centering is often sensible in unsupervised learning contexts because the
object is to understand the internal structure of the data cases xi 2 <p , not
the location of the data cloud (that is easily represented by the sample mean
vector). So accordingly, we …rst translate the data cloud to the origin.
Principal components ideas are then based on the singular value decom-
position of X
X = U D V0
N p N rr rr p
59
…rst right singular vector (i.e. points "at" the raw data). The blue arrow is in
the …rst principal component direction of the standardized data (pointing
in the direction of their greatest variation).
Figure 13: Example of a small p = 2 dataset (red dots) and standardized version
(blue dots) and (multiples of) the …rst right singular vector of the dataset and
the …rst principal direction of the standardized dataset.
60
a sum of rank 1 summands, producing for X l a matrix with each xi in X
replaced by the transpose of its projection onto C (V l ).
Since z j = dj uj , z j v 0j = dj uj v 0j . Then since the uj s and v j s are unit
vectors, the sum of squared entries of both z j and z j v 0j is d2j . These are non-
increasing in j. So the z j and z j v 0j decrease in "size" with j, and directions
v 1 ; v 2 ; : : : ; v r are successively "less important" in describing variation in the
xi and in reconstructing X. This agrees with common interpretation of cases
where a few singular values are much bigger than the others. There "simple
structure" in the data is that observations can be more or less reconstructed as
linear combinations of a few orthonormal vectors.
Figure 14 portrays a hypothetical p = 3 dataset. Shown are the N = 9 data
points, the rank = 1 approximation (black balls on the line de…ned by the …rst
PC direction) and the rank = 2 approximation (black stars on the plane).
61
single "nearly 0" singular value identi…es a quadratic function of the functionally
independent variables that must be essentially constant, a potentially useful
insight about the dataset.
To summarize interpretation of principal components of a centered dataset,
one can say the following:
1 0 1 d2j
z j z j = dj u0j uj dj =
N N N
The SVD of X also implies that
XX 0 = U DV 0 V DU 0 = U D 2 U 0
1 6 Notice that when X has standardized columns (i.e. each column of X, x , has
j
1
h1; xj i = 0 and hxj ; xj i = N ), the matrix N X 0 X is the sample correlation matrix for the p
input variables x1 ; x2 ; : : : ; xp .
62
and it’s then clear that the columns of U are eigenvectors of XX 0 and the
squares of the diagonal elements of D are the corresponding eigenvalues. U D
then produces the N r matrix of principal components of the data. The
principal component directions are unavailable (even indirectly) based only on
this second eigen analysis.
e = 1 1
J = I J (46)
N N
Then using the basic reproducing kernel fact that hK (x; ) ; K (z; )iA = K (x; z)
and the notation K for the Gram matrix (21), it is easy enough to …nd the
representation
1 1 1
C=K JK KJ + 2 J KJ (49)
N N N
63
for the symmetric non-negative de…nite C. Finally, an eigen analysis will
produce principal components (N vectors of length N of scores) for the training
data expressed in the abstract feature space.
To realize the entries in these eigen vectors of kernel principal component
scores as inner products of the N functions (47) with "principal component
directions" in the abstract feature space, A, one may return to Section 2.3.1
and begin with any orthonormal basis E1 ( ) ; E2 ( ) ; : : : ; EN ( ) for the span of the
functions (47) (coming, for example, from use of the Gram-Schmidt process).
Then the general inner product space argument beginning with an N N matrix
with entries K (xi ; ) K ( ) ; Ej ( ) A produces N basis functions V1 ( ) ; V2 ( ) ;
: : : ; VN ( ) whose A inner products with functions (47) are (up to a sign for each
Vj ( )) the entries of the eigen vectors of C. In cases with small p it may be of
interest to examine these abstract principal component direction functions via
some plotting.
and
G = diag (g1 ; g2 ; : : : ; gN )
64
It is common to think of the points x1 ; x2 ; : : : ; xN in <p as nodes/vertices on
a graph, with edges between nodes weighted by similarities sij , and the gi so-
called node degrees, i.e. sums of weights of the edges connected to nodes i.
In such thinking, sij = 0 indicates that there is no "edge" between case i and
case j.
The matrix
L=G S
is called the (unnormalized) graph Laplacian, and one standardized (with
respect to the node degrees) version of this is
e =G
L 1
L=I G 1
S
and a second standardized version is
1=2 1=2 1=2 1=2
L =G LG =I G SG (50)
Note that for any vector u,
N
X N X
X N
0
u Lu = gi u2i ui uj sij
i=1 i=1 j=1
0 1
N N N X N N X
N
1 @X X X X
= sij u2i + sij u2j A ui uj sij
2 i=1 j=1 j=1 i=1 i=1 j=1
N N
1 XX 2
= sij (ui uj ) (51)
2 i=1 j=1
N N
1 XX 2
l = v 0l Lv l = sij (vli vlj ) 0 (52)
2 i=1 j=1
and points xi and xj with large adjacencies must have similar corresponding
coordinates of the eigenvectors. HTF (at the bottom of their page 545) essen-
tially argue that the number of "0 or nearly 0" eigenvalues of L is indicative
65
of the number of connected structures in the original N data vectors. A series
of points could be (in sequence) close to successive elements of the sequence
but have very small adjacencies for points separated in the sequence. "Struc-
tures" by this methodology need NOT be "clumps" of points, but could also be
serpentine "chains" of points in <p .
A second version of this is easily built on the symmetric normalized
Laplacian (50), L . Its eigenvalues are nonnegative and it has a 0 eigen-
value. Let 1 m be the 2nd through (m + 1)st smallest eigenvalues
and v 1 ; : : : ; v m be corresponding eigenvectors. Then for l such a small non-
negative eigenvalue,
N N 2
1=2 1=2 1 XX v vlj
l = vl 0 L vl = vl 0 G LG vl = sij pli p 0
2 i=1 j=1 gi gj
(53)
and points xi and xj with large adjacencies must have similar corresponding
coordinates of the vector G 1=2 v l . So one might treat vectors G 1=2 v l (or
perhaps normalized versions of them) as a second version of m graphical features.
It is also easy to see that
P G 1S
is a stochastic matrix and thus specifying an N -state stationary Markov Chain.
e = I P identi…es groups
It is plausible that the standardized graph Laplacian L
of states such that transition by such a chain between the groups is relatively
infrequent (the MCMC more typically moves within groups).
Part II
Supervised Learning I: Basic
Prediction Methodology
3 (Non-OLS) SEL Linear Predictors
There is more to say about the development of a linear predictor
fb(x) = x0 ^
for an appropriate ^ 2 <p than what is said in books and courses on ordinary
linear models (where ordinary least squares is used to …t the linear form to all
p input variables or to some subset of M of them). We continue the basic
notation of Section 2, where the (supervised learning) problem is prediction,
and there is a vector of continuous outputs, Y , of interest.
66
3.1 Ridge Regression, the Lasso, and Some Other Shrink-
ing Methods
An alternative to seeking to …nd a suitable level of complexity in a linear pre-
diction rule through subset selection and least squares …tting of a linear form
to the selected variables, is to employ a shrinkage method based on a penalized
version of least squares to choose a vector ^ 2 <p to employ in a linear predic-
tion rule. Here we consider several such methods, all of which have parameters
ols
that function as complexity measures and allow ^ to range between 0 and ^
depending upon complexity.
The implementation of these methods is not equivariant to the scaling used
to express the input variables xj . So that we can talk about properties of the
methods that are associated with a well-de…ned scaling, we assume here that
the output variable has been centered (i.e. that hY ; 1i = 0) and that the
columns of X have been standardized (and if originally X had a constant
column, it has been removed).
ols
Here is a penalty/complexity parameter that controls how much b is shrunken
towards 0. The unconstrained minimization problem expressed in (54) has an
equivalent constrained minimization description as
ridge 2
for an appropriate t > 0. (Corresponding to used in form (54), is t = b
used in display (55). Conversely, corresponding to t used in form (55), one
may use a value of in display (54) producing the same error sum of squares.)
Figure 15 is a representation of the constrained version of the ridge optimization
problem for p = 2. Pictured are a contour plot for the quadratic error sum of
ols
squares (Y X ) (Y X ) function of , the constraint region for , b
0
ridge
and b t .
The unconstrained form (54) calls upon one to minimize
0 0
(Y X ) (Y X )+
b ridge = X 0 X + I 1
X 0Y
67
Figure 15: Cartoon Representing the Constrained Version of Ridge Optimiza-
tion for p = 2
d2j+1 d2j
0< <1
d2j+1 + d2j +
we see that the coe¢ cients of the orthonormal basis vectors uj employed to
ridge ols
get Yb are shrunken version of the coe¢ cients applied to get Yb . The
most severe shrinking is enforced in the directions of the smallest principal
components of X (the uj least important in making up low rank approximations
to X). Since from representation (56)
r
!2
ridge 2 X d2j
Yb
2
= hY ; uj i
j=1
d2j +
the "size" of the ridge prediction vector for the N centered responses is decreas-
ing in .
68
Notice also from representation (56) that
r
!
ridge X 1
Yb = hY ; Xv j i Xv j
j=1
d2j +
r
!
X 1
=X hY ; Xv j i v j
j=1
d2j +
so that !
r
X
b ridge 1
= hY ; Xv j i v j
j=1
d2j +
and !2
r
X
ridge 2 1
b = hY ; Xv j i
2
j=1
d2j +
j xj + j 0 xj 0 ( j + j 0 ) xj + (1 )( j + j 0 ) xj 0
2 2 2 2 2 2 2
( j + j0 ) + (1 ) ( j + j0 ) = + (1 ) ( j + j0 )
which is minimum at = 1=2, where the coe¢ cients for xj and xj 0 are the
same.
69
The function
1
df ( ) = tr X X 0 X + I X0
1
= tr U D D 2 + I DU 0
0 ! 1
X r
d2j
= tr @ 2+ uj u0j A
j=1
dj
0 ! 1
X r
d2j
= tr @ 2+ u0j uj A
j=1
dj
r
!
X d2j
=
j=1
d2j +
is called the "e¤ective degrees of freedom" associated with the ridge regression.
In regard to this choice of nomenclature, note that if = 0 ridge regression
is ordinary least squares and this is r, the usual degrees of freedom associated
with projection onto C (X), i.e. trace of the projection matrix onto this column
space.
As ! 1, the e¤ective degrees of freedom goes to 0 as (the centered)
ridge
Yb goes to 0 (corresponding to a constant predictor). Notice also (for future
ridge ridge 1
reference) that since Yb = Xb = X X 0X + I X 0 Y = M Y for
1
M = X X 0X + I X 0 , if one assumes that
2
CovY = I
(conditioned on the xi in the training data, the outputs are uncorrelated and
have constant variance 2 ) then
N
1 X
e¤ective degrees of freedom = tr (M ) = 2
Cov (^
yi ; yi ) (57)
i=1
and the terms Cov(^ yi ; yi ) are the diagonal elements of the upper right block
of this covariance matrix. This suggests that tr(M ) is a plausible general
de…nition for e¤ective degrees of freedom for any linear …tting method Yb =
M Y , and that more generally, the last form in form (57) might be used in
70
situations where Yb is other than a linear form in Y . Further (reasonably
enough) the last form is a measure of how strongly the outputs in the training
set can be expected to be related to their predictions.
Further, in the linear case with Yb = M Y ,
N
X @ y^i
e¤ective degrees of freedom = tr (M ) =
i=1
@yi
and we see that the e¤ective degrees of freedom is some total measure of how
sensitive predictions are at the training inputs xi to the corresponding training
values yi . This raises at least the possibility that in nonlinear cases, an approx-
imate/estimated value of the general e¤ective degrees of freedom (57) might be
the random variable
XN
@ y^i
i=1
@yi Y
bq = arg min (Y
0
X ) (Y X ) (59)
t P p q
with j=1 j j j t
generalizing form (55). The so called "lasso" is the q = 1 case of form (58)
and form (59) and in general, these have been called the "bridge regression"
problems. That is, for t > 0
b lasso = argPmin (Y X ) (Y
0
X ) (60)
t p
with j=1 j j j t
71
Figure 16: Cartoon Representing the Constrained Version of Lasso Optimization
for p = 2
lasso
(in particular its sharp corners at coordinate axes) some coordinates of b t
ols
are often 0, and the lasso automatically provides simultaneous shrinking of b
toward 0 and rational subset selection. (The same is true of cases of form (59)
with q < 1.)
Figure 16 is a representation of the constrained version of the lasso opti-
mization problem for p = 2. Pictured are a contour plot for the quadratic error
0
sum of squares (Y X ) (Y X ) function of , the constraint region for
ols lasso
, b and b t .
For comparison purposes, Figure 17 provides representations of p = 2 bridge
regression constraint regions for t = 1. For q < 1 the regions not only have
"corners," but are not convex.
for the lasso. But Zhou, Hastie, and Tibshirani in 2007 (AOS ) argued that
72
lasso
this is the mean number of non-zero components of b : Obviously then, the
random variable
lasso
\
df ( ) = the number of non-zero components of b
is an unbiased estimator of the e¤ective degrees of freedom.
There are a number of modi…cations of the ridge/lasso idea. One is the
"elastic net" idea. This is a compromise between the ridge and lasso methods.
For an 2 (0; 1) and some t > 0, this is de…ned by
(The constraint is a compromise between the ridge and lasso constraints.) For
comparison purposes, Figure 18 provides some representations of p = 2 elastic
net constraint regions for t = 3 (made using some code of Prof. Huaiqing Wu)
that clearly show the compromise nature of the elastic net. The constraint
regions have "corners" like the lasso regions but are otherwise more rounded
than the lasso regions.
73
Several sources (including a 2005 JRSSB paper of Zhou and Hastie) suggest
that a modi…cation of the elastic net idea, namely
(1 + 2)
b enet (61)
1; 2
(for dj s the singular values of X ). The modi…ed form (61) has estimated
e¤ective degrees of freedom (1 + 2 ) times this value (62).
Breiman proposed a di¤erent shrinkage methodology he called the nonneg-
ative garotte that attempts to …nd "optimal" reweightings of the elements of
b ols . That is, for > 0 Breiman considered the vector optimization problem
de…ned by
8 9
< ols 0 ols Xp =
c = arg min Y Xdiag (c) b Y Xdiag (c) b + cj
c2<p with cj 0; j=1;:::;p : ;
j=1
bols 1
Ridge Regression j
1+
Lasso and (1 + 2)
b enet sign bjols bols
1; 2 j
2 +
sign bjols bols
1 1
Elastic Net 1+ j
2
0 1 2 +
74
These formulas show that best subset regression provides a kind of "hard thresh-
olding" of the least squares coe¢ cients (setting all but the M largest to 0) and
ridge regression provides (the same) shrinking of all coe¢ cients toward 0. Both
the lasso and the nonnegative garotte provide a kind of "soft thresholding"
of the coe¢ cients (typically "zeroing out" some small ones). The elastic net
provides both the ridge type shrinkage of all the coe¢ cients and the lasso soft
thresholding. Note that in this "orthonormal columns" case, modi…cation of
the elastic net coe¢ cient vector as in formula (61) simply reduces it to a cor-
responding lasso coe¢ cient vector. (When the predictors are not orthogonal,
i.e. uncorrelated, one can expect the modi…ed elastic net to be something other
than a lasso.)
For comparison purposes, Figure 19 provides plots of the functions (in the
previous table) of OLS coe¢ cients giving ridge (blue), lasso (red), and nonneg-
ative garotte (green) coe¢ cients for the "orthonormal predictors" case. (Solid
lines are = 1 plots and dotted ones are for = 3.)
Figure 19: Plots of shrunken coe¢ cients for the "orthonomal inputs xj " case.
Ridge is (blue), lasso is (red), and nonnegative garotte is (green).
Of course, there can be more than 2 groups, and in the event that each group
75
is of size 1 this reduces to the simple lasso.
Looking at the geometry of the kind of constraint regions that are associated
with this methodology, it’s plausible (and correct) that it tends to "zero-out"
coe¢ cients in groups associated with the penalty. Figure 20 provides a rep-
resentation of a p = 3 constraint region associated with a grouped lasso where
coordinates 1 and 2 of x are grouped separate from coordinate 3. The corre-
sponding lasso region is shown for comparison purposes.
The development of the lasso and related predictors has been built on mini-
mization of a penalized version of the error sum of squares, N err for SEL. All of
the theory and representations here are special to this case. But as a practical
matter, as long as one has an e¤ective/appropriate optimization algorithm there
is nothing to prevent consideration of other losses. Possibilities include at least
76
LAR
LAR regression parameters b (for the case of X with each column cen-
tered and with norm 1 and centered Y ) follows. (This is some kind of
amalgam of the descriptions of Izenman, CFZ, and the presentation in the 2003
paper Least Angle Regression by Efron, Hastie, Johnstone, and Tibshirani.)
Note that for Y^ a vector of predictions, the vector
c^ X0 Y Y^
has elements that are proportional to the correlations between the columns of
X (the xj ) and the residual vector R = Y Y^ . We’ll let
C^ = max j^
cj j and sj = sign (^
cj )
j
(the index of the predictor xj most strongly correlated with y) and add
j1 to an (initially empty) "active set" of indices, A.
2. Move Y^ from Y^ 0 in the direction of the projection of Y onto the space
spanned by xj1 (namely hxj1 ; Y i xj1 ) until there is another index j2 6= j1
with D E D E
cj2 j = xj2 ; Y Y^ = xj1 ; Y Y^ = j^
j^ cj1 j
At that point, call the current vector of predictions Y^ 1 and the corre-
sponding current parameter vector b 1 and add index j2 to the active set
A. As it turns out, for
( ! !)+
C^0 c^0j C^0 + c^0j
1 = min ;
j6=j1 1 hxj ; xj1 i 1 + hxj ; xj1 i
(where the "+" indicates that only positive values are included in the
minimization) Y^ 1 = Y^ 0 + sj1 1 xj1 = sj1 1 xj1 and b 1 is a vector of all 0s
except for sj1 1 in the j1 position. Let R1 = Y Y^ 1 .
ward the projection of Y onto the sub-space of <N spanned by fxj1 ; : : : ; xjl g.
77
This is (as it turns out) in the direction of a unit ul vector "making equal
angles less than 90 degrees with all xj with j 2 A" until there is an index
jl+1 2
= A with
j^
cl 1;j+1 j = j^
cl 1;j1 j ( = j^
cl 1;j2 j = = j^
cl 1;jl j )
At that point, with Y^ l the current vector of predictions, let b l (with only l
non-zero entries) be the corresponding coe¢ cient vector, take Rl = Y Y^ l
and c^l = X 0 Rl . It can be argued that with
( ! !)+
C^l 1 c^l 1;j C^l 1 + c^l 1;j
l = min ;
j 2A
= 1 hxj ; ul i 1 + hxj ; ul i
This continues until there are r = rank (X) indices in A, and at that point Y^
ols ols
moves from Y^ r 1 to Y^ and b moves from b r 1 to b (the version of an
OLS coe¢ cient vector with non-zero elements only in positions with indices in
A). This de…nes a piecewise linear path for Y^ (and therefore b ) that could,
for example, be parameterized by Y^ or Y Y^ .
There are several issues raised by the description above. For one, the stan-
dard exposition of this method seems to be that the direction vector ul is pre-
scribed by letting W l = (sj1 xj1 ; : : : ; sjl xjl ) and taking
1 1
ul = 1
W l W 0l W l 1
W l W 0l W l 1
1 1
It’s clear that W 0l ul = W l W 0l W l 1 1, so that each of sj1 xj1 ; : : : ; sjl xj1
has the same inner product with ul . What is not immediately clear (but is
argued in Efron, Hastie, Johnstone, and Tibshirani) is why one knows that this
prescription agrees with a prescription of a unit vector giving the direction from
Y^ l 1 to the projection of Y onto the sub-space of <N spanned by fxj1 ; : : : ; xjl g,
namely (for P l the projection matrix onto that subspace)
1
P lY Y^ l 1
P lY Y^ l 1
78
correspondence between the two points of view is probably correct, but is again
not absolutely obvious.
ols
At any rate, the LAR algorithm traces out a path in <p from 0 to b . One
might think of the point one has reached along that path (perhaps parameterized
by Y^ ) as being a complexity parameter governing how ‡exible a …t this
algorithm has allowed, and be in the business of choosing it (by cross-validation
or some other method) in exactly the same way one might, for example, choose
a ridge parameter.
What is not at all obvious but true, is that a very slight modi…cation of
this LAR algorithm produces the whole set of lasso coe¢ cients (60) as its path.
One simply needs to enforce the requirement that if a non-zero coe¢ cient hits
0, its index is removed from the active set and a new direction of movement
is set based on one less input
Pvariable. At any point along the modi…ed LAR
p
path, one can compute t = j=1 j j j, and think of the modi…ed-LAR path as
parameterized by t. (While it’s not completely obvious, this turns out to be
monotone non-decreasing in "progress along the path," or Y^ ).
A useful graphical representation of the lasso path is one in which all coe¢ -
cients ^tj
lasso
are plotted against t on the same set of axes. Something similar is
often done for the LAR coe¢ cients (where the plotting is against some measure
of progress along the path de…ned by the algorithm).
z j = Xv j = dj uj
XM
p cr hY ; z j i
Yb = zj
j=1
hz j ; z j i
M
X
= hY ; uj i uj (63)
j=1
79
Comparing this to displays (42) and (56) we see that ridge regression shrinks
the coe¢ cients of the principal components uj according to their importance
in making up X, while principal components regression "zeros out" those least
important in making up X. Further, since the uj constitute an orthonormal
basis for C (X), for rank (X) = r,
2 M
X r
X 2
p cr ols
Yb hY ; uj i = Yb
2 2
= hY ; uj i (64)
j=1 j=1
p cr
Notice too, that Yb can be written in terms of the original inputs as
M
X
p cr 1
Yb = hY ; uj i Xv j
j=1
dj
0 1
X M
1
=X@ hY ; uj i v j A
j=1
dj
0 1
X M
1
=X@ 2 hY ; Xv j i v j
A
j=1
d j
so that
XM
b p cr = 1
hY ; Xv j i v j (65)
d2
j=1 j
ols p cr
and b is the M = r = rank (X) version of b . As the v j are orthonormal,
as in relationship (64) above
b p cr b ols
ols ols
and principal components regression shrinks both Yb toward 0 in <N and b
toward 0 in <p .
= XX 0 Y
80
It is possible to argue that for w1 = X 0 Y = X 0 Y ; Xw1 = z 1 = X 0 Y is a
linear combination of the columns of X maximizing
jhY ; Xwij
(which is essentially the absolute sample covariance between the variables y and
x0 w) subject to the constraint that kwk = 1.17 This follows because
2
hY ; Xwi = w0 X 0 Y Y 0 Xw
and a maximizer of this quadratic form subject to the constraint is the eigen-
vector of X 0 Y Y 0 X corresponding to its single non-zero eigenvalue. It’s then
easy to verify that w1 is such an eigenvector corresponding to the non-zero
eigenvalue Y 0 XX 0 Y .
Then de…ne X 1 by orthogonalizing the columns of X with respect to z 1 .
That is, de…ne the jth column of X 1 by
hxj ; z 1 i
x1j = xj z1
hz 1 ; z 1 i
and take
p
X
z2 = Y ; x1j x1j
j=1
= X 1 X 10 Y
Y ; X 1w
xlj 1 ; z l
xlj = xlj 1
zl
hz l ; z l i
and let
p
X
z l+1 = Y ; xlj xlj
j=1
= X l X l0 Y
1 7 Note that upon replacing jhY ; Xwij with jhXw; Xwij one has the kind of optimization
81
Partial least squares regression uses the …rst M of these variables z j as input
variables.
The PLS predictors z j are orthogonal by construction. Using the …rst M
of these as regressors, one has the vector of …tted output values
XM
pls hY ; z j i
Yb = zj
j=1
hz j ; z j i
X0 X = N I
so that
ols 1 1 1
Yb = X X 0 X X0 Y = XX 0 Y = z 1
N N
i.e.
ols
z 1 = N Yb
so that
pls ols
Yb 1 = Yb
pls pls pls ols
and thus b 1 = b 2 = = b p = b . All steps of partial least squares
after the …rst are simply providing a basis for the orthogonal complement of the
ols
1-dimensional subspace of C (X) generated by Yb (without improving …tting
at all). That is, here changing M doesn’t change ‡exibility of the …t at all.
(Presumably, when the xj are nearly orthogonal, something similar happens.)
This observation about PLS in cases where predictors are orthogonal has
another related implication. That is that there will be no naive form for e¤ective
degrees of freedom for PLS. Since with z j the jth principal component of X
and, say,
Z M = (z 1 ; z 2 ; : : : ; z M )
82
we have
0 1 0
p cr
Yb = ZM ZM ZM ZM Y
PLS, PCR, and OLS Partial least squares is a kind of compromise between
principal components regression and ordinary least squares. To see this, note
that maximizing
jhY ; Xwij
subject to the constraint that kwk = 1 is equivalent to maximizing the absolute
sample covariance between Y and Xw i.e.
sample standard sample standard sample correlation
deviation of y deviation of x0 w between y and x0 w
or equivalently
2
sample variance sample correlation
(67)
of x0 w between y and x0 w
subject to the constraint. Now if only the …rst term (the sample variance of
x0 w) were involved in product (67), a …rst principal component direction would
be an optimizing w1 , and z 1 = X 0 Y Xw1 a multiple of the …rst principal
component of X. On the other hand, if only the second term were involved,
^ols = ^ols would be an optimizing w1 , and z 1 = Y^ ols X 0 Y = ^ols a multi-
ple of the vector of ordinary least squares …tted values. The use of the product
of two terms can be expected to produce a compromise between these two.
Note further that this logic applied at later steps in the PLS algorithm then
produces for z l a compromise between a …rst principal component of X l 1
and a suitably constrained multiple of the vector of least squares …tted values
based on the matrix of inputs X l 1 . The matrices X l have columns that
are the projections of the corresponding columns of X onto the orthogonal
complement in C (X) of the span of fz 1 ; z 2 ; : : : ; z l g (i.e. are corresponding
columns of X minus their projections onto the span of fz 1 ; z 2 ; : : : ; z l g) and
C (X) C X 1 C X2 .
of functions, whereby any function of interest can be represented (or practically speaking, at
83
fhm g and predictors of the form or depending upon the form
p
X
f^ (x) = ^m hm (x) = h (x)0 ^ (68)
m=1
0
for h (x) = (h1 (x) ; : : : ; hp (x)). (The general notation used in Section 1.4.5
was T (x) rather than h (x) being used here. The slight specialization here is to
the case where the components of the vector-valued h (x) are "basis" functions.)
We next consider some ‡exible methods employing this idea. Notice that
…tting of form (68) can be done using any of the methods just discussed based
on the N p matrix of inputs
0 0 1
h (x1 )
B h (x2 )0 C
B C
X = (hj (xi )) = B .. C
@ . A
0
h (xN )
For example, using M N=2 sin-cos pairs and the constant, one could consider
the …tting the forms
M
X M
X
f (x) = 0 + 1m sin (m2 x) + 2m cos (m2 x) (69)
m=1 m=1
If one has training xi on an appropriate regular grid, the use of form (69) leads
to orthogonality in the N (2M + 1) matrix of values of the basis functions X
and simple/fast calculations.
least approximated) as a linear combination of the the "basis" elements. Periodic functions
of a single variable can be approximated by linear combinations of sine (basis) functions of
various frequencies. General di¤erentiable functions can be approximated by polynomials
(linear combinations of monomial basis functions). Etc.
84
Unless, however, one believes that E[yjx = u] is periodic in u, form (69) has
its serious limitations. In particular, unless M is very very large, a trigono-
metric series like (69) will typically provide a poor approximation for a function
that varies at di¤erent scales on di¤erent parts of [0; 1], and in any case, the co-
e¢ cients necessary to provide such localized variation at di¤erent scales have no
obvious simple interpretations/connections to the irregular pattern of variation
being described. So-called "wavelet bases" are much more useful in providing
parsimonious and interpretable approximations to such functions. The simplest
wavelet basis for L2 [0; 1] is the Haar basis that we proceed to describe.
De…ne the so-called Haar "father" wavelet
Linear combinations of these functions provide all elements of L2 [0; 1] that are
constant on 0; 21 and on 12 ; 1 . Write
0 = f'; g
Next, de…ne
p 1 1 1
1;0 (x) = 2 I 0<x I <x and
4 4 2
p 1 3 3
1;1 (x) = 2 I <x I <x 1
2 4 4
and let
1 =f 1;0 ; 1;1 g
Using the set of functions 0 [ 1 one can build (as linear combinations) all
elements of L2 [0; 1] that are constant on 0; 41 and on 14 ; 12 and on 21 ; 34 and
on 34 ; 1 .
The story then goes on as one should expect. One de…nes
1 1 1
2;0 (x) = 2 I 0 < x I <x and
8 8 4
1 3 3 1
2;1 (x) = 2 I <x I <x and
4 8 8 2
1 5 5 3
2;2 (x) = 2 I <x I <x and
2 8 8 4
3 7 7
2;3 (x) = 2 I <x I <x 1
4 8 8
85
and lets
2 =f 2;0 ; 2;1 ; 2;2 ; 2;3 g
Figure 21 shows the sets of basis functions 0; 1; and 2.
Figure 21: Sets of Haar basis functions 0 (blue), 1 (red), and 2 (green).
In general,
p j
m;j (x) = 2m 2m x for j = 0; 1; 2; : : : ; 2m 1
2m
and
m =f m;0 ; m;1 ; : : : ; m;2m 1 g
The Haar basis of L2 [0; 1] is then
[1
m=0 m
86
Then, one might entertain use of the Haar basis functions through order M
in constructing a form
m
M 2X1
X
f (x) = 0 + mj m;j (x) (70)
m=0 j=0
(with the understanding that 0;0 = ), a form that in general allows building
of functions that are constant on consecutive intervals of length 1=2M +1 . This
form can be …t by any of the various regression methods (especially involving
thresholding/selection, as a typically very large number, 2M +1 , of basis func-
tions is employed in form (70)). (See HTF Section 5.9.2 for some discussion of
using the lasso with wavelets.) Large absolute values of coe¢ cients mj encode
scales at which important variation in the value of the index m, and location
in [0; 1] where that variation occurs in the value j=2m . Where (perhaps af-
ter model selection/ thresholding) only a relatively few …tted coe¢ cients are
important, the corresponding scales and locations provide an informative and
compact summary of the …t. A nice visual summary of the results of the …t
can be made by plotting for each m (plots arranged vertically, from M through
0, aligned and to the same scale) spikes of length j mj j pointed in the direction
of sign( mj ) along an "x" axis at positions (say) (j=2m ) + 1=2m+1 .
In special situations where N = 2K and
1
xi = i for i = 1; 2; : : : ; 2K
2K
and one uses the Haar basis functions through order K 1, the …tting of form
(70) is computationally clean, since the vectors
0 1
m;j (x1 )
B .. C
@ . A
m;j (xN )
87
and forms for f (x) that are
Further, one can enforce continuity and di¤erentiability (at the knots) conditions
P(M +1)(K+1)
on a form f (x) = m=1 m hm (x) by enforcing some linear relations
between appropriate ones of the m . While this is conceptually simple, it is
messy. It is much cleaner to simply begin with a set of basis functions that are
tailored to have the desired continuity/di¤erentiability properties.
A set of M + 1 + K basis functions for piecewise polynomials of degree M
with derivatives of order M 1 at all knots is easily seen to be
M M M
1; x; x2 ; : : : ; xM ; (x 1 )+ ; (x 2 )+ ; : : : ; (x K )+
M
(since the value and …rst M 1 derivatives of (x j )+ at j are all 0). The
choice of M = 3 is fairly standard.
Since extrapolation with polynomials typically gets worse with order, it is
common to impose a restriction that outside ( 1 ; K ) a form f (x) be linear. For
the case of M = 3 this can be accomplished by beginning with basis functions
3 3 3
1; x; (x 1 )+ ; (x 2 )+ ; : : : ; (x K )+ and imposing restrictions necessary to
force 2nd and 3rd derivatives to the right of K to be 0: Notice that (considering
x > K)
0 1
XK XK
d2 @ 3A
0 + 1 x + j (x j )+ = 6 j (x j) (71)
dx2 j=1 j=1
and 0 1
K
X K
X
d3 @ 3A
0 + 1x + j (x j )+ =6 j (72)
dx3 j=1 j=1
88
PK
So, linearity for large x requires (from equation (72)) that j=1 j = 0. Fur-
ther, substituting this into relationship (71) means that linearity also requires
PK PK 1
that j=1 j j = 0. Using the …rst of these to conclude that K = j=1 j
and substituting into the second yields
K
X2 K j
K 1 = j
j=1 K K 1
and then
K
X2 K
X2
K j
K = j j
j=1 K K 1 j=1
These then suggest the set of basis functions consisting of 1; x and for j =
1; 2; : : : ; K 2
3 K j 3 K j 3 3
(x j )+ (x K 1 )+ + (x K )+ (x K )+
K K 1 K K 1
(73)
3 K j 3 K 1 j 3
= (x j )+ (x K 1 )+ + (x K )+
K K 1 K K 1
(These are essentially the basis functions that HTF call their Nj .) Their use
produces so-called "natural" (linear outside ( 1 ; K )) cubic regression splines.
There are other (harder to motivate, but in the end more pleasing and
computationally more attractive) sets of basis functions for natural polynomial
splines. See the B-spline material at the end of HTF Chapter 5.
The biggest problem with this potential method is the explosion in the size
of a tensor product basis as p increases. For example, using K knots for cubic
p
regression splines in each of p dimensions produces (4 + K) basis functions
for the p-dimensional problem. Some kind of forward selection algorithm or
89
shrinking of coe¢ cients will be needed to produce any kind of workable …ts
with such large numbers of basis functions. For example, the multivariate
smoothing routines provided in the mgcv R package of Wood allow for quadratic
penalized (ridge regression type) …tting of forms like (74). The following discus-
sion of "MARS" concerns one kind of forward selection algorithm using (data-
dependent) linear regression spline basis functions and products of them for
building predictors
(xij is the jth coordinate of the ith input training vector and both hij1 (x) and
hij2 (x) depend on x only through the jth coordinate of x) portrayed in Figure
22.
and set
f^1 (x) = ^0 + ^11 g11 (x) + ^12 g12 (x)
1 9 Notice that in the framework of Section 1.4.5 these functions of the input x are of the
90
2. At stage l of the predictor-building process, with predictor
l 1
X
f^l 1 (x) = ^0 + ^m1 gm1 (x) + ^m2 gm2 (x)
m=1
in hand, consider for addition to the model pairs of functions that are
either of the basic form (75) or of the form
or of the form
where M (l) is some kind of degrees of freedom …gure. One must take account
of both the …tting of the coe¢ cients in this and the fact that knots (values
xij ) have been chosen. The HTF recommendation is to use
(where presumably the knot count refers to di¤erent xij appearing in at least
one gm1 (x) or gm2 (x)).
Other versions of "MARS" algorithms potentially remove the constraint that
no xj appear in any candidate product more than once (eliminating the piece-
wise linearity of sections of the predictor), consider not pairs but single hinge
functions at each stage of feature addition, and/or follow a forward-selection
search for features with a backwards-elimination phase (these guided by signif-
icant "change in SSE" or "F/t test" criteria). All of these variants amount
91
to the "special sauce" of a particular MARS implementation set by its de-
signer/programmer. Particular implementations have user-selectable parame-
ters like the maximum number of terms in a forward selection phase, the maxi-
mum order of (pure and mixed) terms considered, the "signi…cance level" used
for guiding forward and backward phases of selection of "features," etc. In
practical application, one should select these parameters via cross-validation,
more or less thinking of whatever choices the developer has made in his or her
implementation as simply de…ning some …tting/predictor-building "black box."
A routine like the train() function in caret is invaluable in making these
choices.
Figure 23 portrays a simple predictor (of home sales price) of the kind that
a MARS algorithm can produce.
Figure 23: An example of the kind of prediction surface that can be generated
by a MARS algorithm. "Price" varies with two predictors.
Amazingly enough, this optimization problem has a solution that can be fairly
simply described. f^ is a natural cubic spline with knots at the distinct values
xi in the training set. That is, for a set of (now data-dependent, as the knots
come from the training data) basis functions for such splines
h1 ; h2 ; : : : ; hN
92
(here we’re tacitly assuming that the N values of the input variable in the
training set are all di¤erent)
N
X
f^ (x) = b j hj (x) (76)
j=1
and so
N X
X N
2
(g 00 (x)) = 00
j l hj (x) h00l (x)
j=1 l=1
0
Then, for = ( 1; 2; : : : ; N ) and20
Z !
b
= h00j (t) h00l (t) dt
N N a
relatively simple formulas for the entries of . See the exercises for this section for details.
93
Corresponding to coe¢ cient vector (78) is a vector of smoothed output values
1
Yb = H H 0 H + H 0Y
P BP B = P B
i.e. P B is idempotent,
S S S
meaning that S S S is non-negative de…nite. P B is of rank p = tr(P B ),
while S is of rank N .
In a manner similar to what is done in ridge regression we might de…ne an
"e¤ective degrees of freedom" for S (or for smoothing) as
df ( ) = tr (S ) (79)
We proceed to develop motivation and a formula for this quantity and for Y^ .
Notice that for
1
K = H0 H 1
one has
1
S = H H 0H + H0
1
= H H0 I + H0 1
H H H0
1 1
= HH I + H0 1
H H0 1
H0
1
= (I + K) (80)
94
so that this matrix K can be thought of as de…ning a "penalty" in …tting a
smoothed version of Y :
Then, since S is symmetric non-negative de…nite, it has an eigen decom-
position as
XN
S = U DU 0 = dj uj u0j (82)
j=1
det (K I) = 0
Now
1
det (K I) = det [(I + K) (1 + ) I]
for
1 2 N 2 N 1 = N =0
the eigenvalues of K (that themselves do not depend upon ). So, for example,
in light of facts (79), (82), and (83), the smoothing e¤ective degrees of freedom
are
XN N
X2 1
df ( ) = tr (S ) = dj = 2 +
j=1 j=1
1 + j
which is clearly decreasing in (with minimum value 2 in light of the fact that
S has two eigenvalues that are 1).
Further, consider uj , the eigenvector of S corresponding to eigenvalue dj .
S uj = dj uj so that
1
uj = S dj uj = (I + K) dj uj
95
so that
uj = dj uj + dj Kuj
and thus
1 dj
Kuj = uj = N j+1 uj
dj
That is, uj is an eigenvector of K corresponding to the (N j + 1)st largest
eigenvalue. That is, for all the eigenvectors of S are eigenvectors of K and
thus do not depend upon .
Then, for any
0 1
N
X
Yb = S Y = @ dj uj u0j A Y
j=1
N
X
= dj huj ; Y i uj
j=1
N
X huj ; Y i
= hu1 ; Y i u1 + hu2 ; Y i u2 + uj (84)
j=3
1 + N j+1
U = (uN ; uN 1 ; : : : ; u1 )
= (u1 ; u2 ; : : : ; uN )
96
and criterion (81) can be written as
0 0
minimize (Y v) (Y v) + v 0 U diag ( 1 ; 2; : : : ; N ) U v
v2<N
or equivalently as
0 1
N
X2
0 2
minimize @(Y v) (Y v) + j uj ; v A (85)
v2<N
j=1
for ZZ 2 2 2
@2h @2h @2h
J [h] +2 + dx1 dx2
<2 @x21 @x1 @x2 @x22
An optimizing f^ : <2 ! < can be identi…ed and is called a "thin plate spline."
As ! 0, f^ becomes an interpolator, as ! 1 it de…nes the OLS plane
97
through the data in 3-space. In general, it can be shown to be of the form
N
X
0
f (x) = 0 + x+ i gi (x) (86)
i=1
where gi (x) = (kx xi k) for (z) = z 2 ln z 2 . The gi (x) are "radial basis
functions" (radially symmetric basis functions) and …tting is accomplished much
as for the p = 1 case. The form (86) is plugged into the optimization criterion
and a discrete penalized least squares problem emerges (after taking account of
some linear constraints that are required to keep J [f ] < 1). HTF seem to
indicate that in order to keep computations from exploding with N , it usually
su¢ ces to replace the N functions gi (x) in form (86) with K N functions
gi (x) = (kx xi k) for K potential input vectors xi placed on a rectangular
grid covering the convex hull of the N training data input vectors xi .
For large p, one might simply declare that attention is going to be limited
to predictors of some restricted form, and for h in that restricted class, seek to
optimize
XN
2
(yi h (xi )) + J [h]
i=1
and invent an appropriate penalty function. It seems like a sum of 1-d smooth-
ing spline penalties on the gj and 2-d thin plate spline penalties on the gjk is the
most obvious starting point. Details of …tting are a bit murky (though I am sure
that they can be found in book on generalized additive models). Presumably
one cycles through the summands in display (87) iteratively …tting functions to
sets of residuals de…ned by the original yi minus the sums of all other current
versions of the components until some convergence criterion is satis…ed. Func-
tion (87) has a kind of "main e¤ects plus 2-factor interactions" form, but it is
(at least in theory) possible to also consider higher order terms in this kind of
expansion.
98
5.3 An Abstraction of the Smoothing Spline Material and
Penalized Fitting in <N
In abstraction of the smoothing spline development, suppose that fuj g is a
set of M N orthonormal N -vectors, 0; j 0 for j = 1; 2; : : : ; M; and
consider the optimization problem
0 1
XM
0 2
minimize @(Y v) (Y v) + j huj ; vi
A
v2 spanfuj g
j=1
PM PM 2 PM 2
For v = j=1 cj uj 2 spanfuj g, the penalty is j=1 j huj ; vi = j=1 j cj
and in this penalty, j is a multiplier of the squared length of the component
of v in the direction of uj . The optimization criterion is then
M
X M
X M
X
0 2 2 2
(Y v) (Y v) + j huj ; vi = (huj ; Y i cj ) + j cj
j=1 j=1 j=1
huj ; Y i
copt
j =
1+ j
i.e.
M
X huj ; Y i
Y^ = v opt = uj
j=1
1+ j
From this it’s clear how the penalty structure dictates optimally shrinking the
components of the projection of Y onto spanfuj g.
It is further worth noting that for a given set of penalty coe¢ cients, Y^ can
be represented as SY for
M
X 1 1
S= dj uj u0j = U diag ;:::; U0
j=1
1+ 1 1+ M
In this context, it would be very natural to penalize the later uj more severely
than the early ones.
99
5.4 Graph-Based Penalized Fitting/Smoothing (and Semi-
Supervised Learning)
Another interesting smoothing methodology related to the material of the three
previous sections concerns use of …tting penalties based on the graph Lapla-
cians introduced in Section 2.4.3.22 Consider then N complete data cases
(x1 ; y1 ) ; : : : ; (xN ; yN ) and M 0 additional data cases where only inputs
xN +1 ; : : : ; xN +M are available. There is no necessity here that M > 0, but it
can be so in the event that predictions are desired at xN +1 ; : : : ; xN +M whose
values might not be in the training set. Where there are M > 0 genuine "unla-
beled cases" whose inputs are assumed to come from the same mechanism as the
inputs x1 ; : : : ; xN and might be used to more or less "…ll in" the relevant part
of the input space not covered by the complete/labeled data cases, the termi-
nology semi-supervised learning is sometimes used to describe the building
of a predictor for y at all N + M input vectors. The case M = 1 might be
used to simply make a single prediction at a single input not exactly seen in a
"usual" training set of N complete data pairs.
Suppose that following the development of Section 2.4.3 one can make an
adjacency matrix based on the N + M input vectors,
0 1
SL S LU
S = (sij ) i=1;:::;N +M = @ N N N M A
j=1;:::;N +M S UL SU
M N M M
Then with 0 1
YL
Y =@ N 1 A
(N +M ) 1 YU
M 1
2 2 The material here is adapted from "Graph-Based Semi-Spervised Learning with BIG
Data" by Banergee, Culp, Ryan, and Michailidis, that appeard in Research on Applied Cy-
bernetics and System Science in 2017.
100
consider the optimization problem in <N +M
0
minimize (Y L v L ) (Y L v L ) + v 0 Lv (88)
v2<N +M
for some > 0 (or the same with L replacing L in the quadratic penalty term).
The developments (52) and (53) of Section 2.4.3 show that upon expanding
v in terms of the N + M (orthonormal) eigenvectors of L (or L ) it follows
that components of v that are multiples of late eigenvectors (ones with small
eigenvalues)
This strongly suggests that solutions to the optimization problem (88) will pro-
vide smoothed prediction vectors Y^ where entries with corresponding inputs
with large adjacencies are similar.
Recent work of Culp and Ryan provides theory, methods, and software for
solving the problem (88) and many nice generalizations of it (including consid-
eration of losses other than SEL that produce methods for classi…cation prob-
lems). For purposes of exposition here, we will provide the explicit solution
that is available for the SEL problem. It turns out that the problem (88) and
generalizations of it separate nicely into two parts. That is
opt opt
Y^ U = LU 1 LUL Y^ L (89)
opt
(or the same with L s replacing Ls) where Y^ L = v L solving
0 ~ L vL
minimize (Y L v L ) (Y L v L ) + v 0L L (90)
vL 2<N
~ L = LL LLU L 1 LUL (or, again, the same with L s replacing Ls). (Gen-
for L U
eralizations of the development here replace SSE in displays (88) and (90) with
other losses, but the form (89) is unchanged.) But the problem (90) is familiar
and its solution a simple consequence of vector calculus
opt 1
Y^ L = I + L
~L YL
This is exactly parallel to the displays (80) and (81) and the discussion around
1
them. I+ L ~L (and its starred version) is a smoother/shrinker matrix.
Further, the matrix LU 1 LUL in display (89) and its starred version are sto-
opt opt
chastic matrices and entries of Y^ U are averages of the elements of Y^ L .
101
6 Kernel and Local Regression Smoothing Meth-
ods and SEL Prediction
The central idea of this material is that when …nding f^ (x0 ) one might weight
points in the training set according to how close they are to x0 , do some kind
of …tting around x0 , and ultimately read o¤ the value of the …t at x0 .
This typically smooths training outputs yi in a more pleasing way than does
a k-nearest neighbor average, but it has obvious problems at the ends of the
interval [0; 1] and at places in the interior of the interval where training data are
dense to one side of x0 and sparse to the other, if the target E[yjx = z] has non-
zero derivative at z = x0 . For example, at x0 = 1 only xi 1 get weight, and
if E[yjx = z] is decreasing at z = x0 = 1, f^ (1) will be positively biased. That
is, with usual symmetric kernels, predictor (92) will fail to adequately follow an
obvious trend at 0 or 1 (or at any point between where there is a sharp change
in the density of input values in the training set).
2 3 This is again a potentially di¤erent usage of the word "kernel" than that in Section 1.4.3
102
Figure 24: Three standard choices of D (t): Epanechnikov quadratic kernel
(blue), tricube (black), and standard normal density (red).
Now the weighted least squares problem (93) has an explicit solution. Let
0 1
1 x1
B 1 x2 C
B C
B =B . .. C
N 2 @ .. . A
1 xN
and take
W (x0 ) = diag (K (x0 ; x1 ) ; : : : ; K (x0 ; xN ))
N N
103
original kernel values and the least squares …tting operation to produce a kind
of "equivalent kernel" (for a Nadaraya-Watson type weighted average).
Recall that for smoothing splines, smoothed values are
Yb = S Y
df ( ) = tr (S )
Yb = L Y
and de…ne
df ( ) = tr (L )
HTF suggest that matching degrees of freedom for a smoothing spline and a
kernel smoother produces very similar equivalent kernels, smoothers, and pre-
dictions.
There is a famous theorem of Silverman that adds technical credence to this
notion. Roughly the theorem says that for large N , if in the case p = 1 the
inputs x1 ; x2 ; : : : ; xN are iid with density p (x) on [a; b], is neither too big nor
too small,
1 juj juj
DS (u) = exp p sin p +
2 2 2 4
1=4
(x) =
N p (x)
and
1 z x
G (z; x) = DS
(x) p (x) (x)
then for xi not too close to either a or b,
1
(S )ij G (xi ; xj )
N
(in some appropriate probabilistic sense) and the smoother matrix for cubic
spline smoothing has entries like those that would come from an appropriate
kernel smoothing.
104
6.2 Local Regression Smoothing in p Dimensions
A direct generalization of 1-dimensional local regression smoothing to p dimen-
sions might go roughly as follows. For D as before, and x 2 <p , one might
set
kx x0 k
K (x0 ; x) = D (96)
and …t linear forms locally by choosing (x0 ) 2 < and (x0 ) 2 <p to solve the
optimization problem
N
X
0 2
minimize K (x0 ; xi ) yi + xi
and
i=1
and predicting as
f^ (x0 ) = (x0 ) + 0
(x0 ) x0
This seems typically to be done only after standardizing the coordinates of x
and can be e¤ective as longs as N is not too small and p is not more than 2
or 3. However for p > 3, the curse of dimensionality comes into play and N
points usually just aren’t dense enough in p-space to make direct use of kernel
smoothing e¤ective. If the method is going to be successful in <p it will need
to be applied under appropriate structure assumptions.
One way to apply additional structure to the p-dimensional kernel smoothing
problem is to essentially reduce input variable dimension by replacing the kernel
(96) with the "structured kernel"
0q 1
0
(x x0 ) A (x x0 )
K ;A (x0 ; x) = D @ A
This amounts to using not x and <p distance from x to x0 to de…ne weights,
1 1 1
but rather D 2 V 0 x and <p distance from D 2 V 0 x to D 2 V 0 x0 . In the event
that some entries of D are 0 (or are nearly so), this basically reduces dimension
from p to the number of large eigenvalues of A and de…nes weights in a space
of that dimension (spanned by eigenvectors corresponding to non-zero eigenval-
ues) where the curse of dimensionality may not preclude e¤ective use of kernel
smoothing. The "trick" is, of course, identifying the right directions into which
to project. (Searching for such directions is part of the Friedman "projection
pursuit" ideas discussed below.)
105
7 High-Dimensional Use of Low-Dimensional
Smoothers and SEL Prediction
There are several ways that have been suggested for making use of fairly low-
dimensional (and thus, potentially e¤ective) smoothing in large p problems.
One of them is the "structured kernels" idea just discussed. Two more follow.
1. …tting via some appropriate (often linear) operation (e.g., spline or kernel
smoothing)
gl xl to "data" xli ; yil i=1;2;:::;N
for 0 1
X
yil = yi @b + gm (xm
i )
A
m6=l
106
A more principled SEL …tting methodology for additive forms like that in
display (98) (e.g. implemented by Wood in his mgcf R package) is the simulta-
neous …tting of and all the functions gl via penalized least squares. That is,
using an appropriate set of basis functions for smooth functions of xl (often a
tensor product basis in the event that the dimension of xl is more than 1) each
gl might be represented as a linear combination of those basis functions. Then
form (98) is in fact a constant plus a linear combination of basis functions. So
upon adopting a quadratic penalty for the coe¢ cients, one has a kind of ridge
regression problem and explicit forms for all …tted coe¢ cients and b. The
practical details of making the various bases and picking ridge parameters, etc.
are not trivial, but the basic idea is clear.
The simplest version of this line of development, based on form (97), might
be termed …tting of a "main e¤ects model." But the approach might as well
be applied to …t a "main e¤ects and two factor interactions model," using some
gl s that are functions of only one coordinate of x and others that depend upon
only two coordinates of the input vector. One may mix types of predictors
(continuous, categorical) and types of functions of them in the additive form
to produce all sorts of interesting models (including semi-parametric ones and
ones with low order interactions).
N
X 2
K ((x30 ; x40 ) ; (x3i ; x4i )) (yi ( (x30 ; x40 ) + 1 (x30 ; x40 ) x1i + 2 (x30 ; x40 ) x2i ))
i=1
107
7.2 Projection Pursuit Regression
For w1 ; w2 ; : : : ; wM unit p-vectors of parameters, we might consider as predic-
tors …tted versions of the form
M
X
f (x) = gm (w0m x) (99)
m=1
108
of non-linear functions of linear combinations of ... non-linear functions of linear
combinations of coordinates of x. Figure 25 is a network diagram representa-
tion of a toy single hidden layer feed-forward neural net with 3 inputs, 2 hidden
nodes, and 2 outputs.24 The constants x0 = 1 and z0 = 1 allow for "bi-
ases" (i.e. constant terms) in the linear combinations (technically making them
"a¢ ne" transformations rather than linear ones). The and parameters are
sometimes called "weights."
and then
y1 = g1 ( 01 1+ 11 z1 + 21 z2 ; 02 1+ 12 z1 + 22 z2 )
y2 = g2 ( 01 1+ 11 z1 + 21 z2 ; 02 1+ 12 z1 + 22 z2 )
109
or the (completely equivalent in this context26 ) hyperbolic tangent function
exp (u) exp ( u)
(u) = tanh (u) =
exp (u) + exp ( u)
These functions are di¤erentiable at u = 0, so that for small s the functions
of x entering the gs in a single hidden layer network are nearly linear. For large
s the functions are nearly step functions. In light of the latter, it is not
surprising that there are universal approximation theorems that guarantee that
any continuous function on a compact subset of <p can be approximated to any
degree of …delity with a single layer feed-forward neural net with enough nodes
in the hidden layer. This is both a blessing and a curse. It promises that
these forms are quite ‡exible It also promises that there must be both over-
…tting and identi…ability issues inherent in their use (the latter in addition to the
identi…ability issues already inherent in the symmetric nature of the functional
forms assumed for the predictors).
More recently, sigmoidal forms for the activation function have declined in
popularity. Instead, the hinge or positive part function
is often used. In common parlance, this makes the hidden nodes "recti…ed
linear units" (ReLUs). Note that this choice makes functions of x entering an
output layer piece-wise linear and continuous (not at all an unreasonable form).
(where it is understood that the kth probability, gk , depends upon the input
x through the neural net compositions of functions and the …nal use of the
softmax function).
2 6 This 1
is because tanh (u) = 2 1+exp( 2u)
1.
110
8.3 Fitting Neural Networks
8.3.1 The Back-Propagation Algorithm
The most common …tting algorithm for neural networks is something called the
"back-propagation algorithm" or the "delta rule." It is simply a gradient descent
algorithm for the entire set of weights involved in making the outputs (in the
simple case illustrated in Figure 25, the s and s). Rather than labor through
the nasty notational issues required to completely detail such an algorithm, we
will here only lay out the heart of what is needed.
For a training set of size N , loss L f^(xi ) ; yi incurred for input case i when
the K predictions f^k (xi ) are made (corresponding to the K output nodes) and
a sum of such losses is to be minimized, if one can …nd the partial derivatives of
the coordinates of f^(x) with respect to the weights, the chain rule will give the
partials of the total loss and allow iterative search in the direction of a negative
gradient of the total loss. So we begin with description of how to …nd partials
for f^k (x), a coordinate of the …tted output vector.
Consider a neural network with H layers of hidden nodes indexed by h =
1; 2; : : : ; H beginning with the layer immediately before the output layer and
proceeding (right to left in a diagram like Figure 25) to the one that is built
from linear combinations of the coordinates of x. We’ll use the notation mh
for the number of nodes in layer h, including a node representing the "bias"
input 1 (represented by x0 = 1 and z0 = 1 in Figure 25). For a real-valued
activation function of a single real variable , de…ne a vector-valued function
m m
m :< !< by
z 0H = 1; (1; x0 ) AH
z 0h = 1; z 0h+1 Ah (101)
111
Then for A0 an m1 K matrix of parameters and gk a function of K real
variables, the kth coordinate of the output is
gk z 01 A0 (102)
Further, since using form (102) and the h = 1 version of form (101) the kth
coordinate of the prediction is
gk 1; z 02 A1 A0
writing a1ij for the (i; j) entry of A1 , the chain rule implies that (with A0l the
lth column of A0 ) the partial derivative of the kth coordinate of the prediction
with respect to a1ij is
K
X (l) @
gk 1; z 02 A1 A0 1; z 02 A1 A0
@a1ij l
l=1
K
X (l) @
= gk 1; z 02 A1 A0 1; z 02 A1 A0l
@a1ij
l=1
K
X K
X
(l) @
= gk 1; z 02 A1 A0 a0kl 1; z 02 A1
@a1ij k
l=1 k=1
K
X (l) @
= gk 1; z 02 A1 A0 a0jl z 02 A1j
@a1ij
l=1
K
X (l)
= gk 1; z 02 A1 A0 0
z 02 A1j a0jl z2i
l=1
gk 1; 1; 1; 1; (1; x0 ) AH AH 1
A2 A1 A0
112
made by successive compositions using the activation function and linear com-
@ y^k
binations with coe¢ cients in the matrices Ah , from which partials are
@ahij
obtainable in the style above, by repeatedly using the chain rule. No doubt
some appropriate use of vector calculus and corresponding notation could im-
prove the looks of these expressions and recursions can be developed, but what
is needed should be clear. Further, in many contexts numerical approximation
of these partials may be the most direct and e¢ cient means of obtaining them.
Then for loss L f^; y let
@
Lk f^; y = L f^; y
@ f^k
For a an element of one of the Ah matrices, the partial derivative of the contri-
bution of case i to a total loss with respect to it is
K
X @
Lk f^(xi ) ; yi gk z 01 (xi ) A0
@a
k=1
(for z 1 (xi ) the set of values from the …nal hidden nodes and partials found as
above) and the partial derivative of the total loss with respect to it is
N X
X K
@
D (a) = Lk f^(xi ) ; yi gk z 01 (xi ) A0
i=1 k=1
@a
The gradient of the total loss as a function of the matrices of weights then has
entries D (a) and an iterative search to optimize total loss with a current set of
iterates acurrent can produce new iterates
anew = acurrent D (acurrent ) (103)
for some "learning rate" > 0.
Of course, in SEL/univariate regression contexts, it is common to have K = 1
2
and take L f^; y = f^ y . In K-class classi…cation models, it seems most
^ = g1 z 01 A0 ; g2 z 01 A0 ; : : : ; gK z 01 A0
common to use a K-dimensional output g
with the "softmax" gk as de…ned in display (100) and to employ the cross-
entropy loss
XK
L (^
g ; y) = I [y = k] ln gk (x)
k=1
There are various possibilities for regularization of the ill-posed …tting prob-
lem for neural nets, ranging from the fairly formal and rational to the very
informal and ad hoc. One possibility is to employ "stochastic gradient de-
scent" and newly choose a random subset of the training set for use at each
iteration of …tting. (It is popular to even go so far in this regard as to employ
only a single case at each iteration.) Another common approach is to simply
use an iterative …tting algorithm and "stop it before it converges." We proceed
to brie‡y discuss more formal regularization.
113
8.3.2 Formal Regularization of Fitting
Suppose that the various coordinates of the input vectors in the training set
have been standardized and one wants to regularize the …tting of a neural net.
One possible way of proceeding is to de…ne a penalty function like
H X
X 2
J (A) = ahij (104)
h=0 i;j
for A standing for the entire set of weights in A0 ; A1 ; : : : ; AH (it is not ab-
solutely clear whether one really wants to include the weights on the "bias"
terms in the neural net sums in (104)) and seek not to partially optimize the
PN
total training set loss i=1 L f^A (xi ) ; yi but rather to fully optimize
N
X
L f^A (xi ) ; yi + J (A) (105)
i=1
yi = f (xi jA) + i
2
for the i iid N 0; , a likelihood is simply
N
Y
2 2
l A; = h yi jf (xi jA) ;
i=1
2
for h j ; the normal pdf. If then g A; 2 speci…es a prior distribution
2
for A and , a posterior for A; 2 has density proportional to
2 2
l A; g A;
For example, one might well assume that a priori the as are iid N 0; 2 (where
small 2 will provide regularization and it is again unclear whether one wants
to include the as corresponding to bias terms in such an assumption or to
instead provide more di¤use priors for them, like improper "Uniform( 1; 1)"
or at least large variance normal ones). A standard improper prior for 2
is ln Uniform( 1; 1). In any case, whether improper or proper, abuse
notation and write g 2 for a prior density for 2 .
114
Then with independent mean 0 variance 2 priors for all the weights (except
possibly the ones for bias terms that might be given Uniform( 1; 1) priors)
one has
2 2
ln l A; g A;
N
1 X 2 1 2
/ N K ln ( ) 2
(yi f (xi jA)) J (A) + ln g
2 i=1
2 2
N
!
1 X 2
2
2
= N K ln ( ) + ln g 2
(yi f (xi jA)) + 2
J (A) (106)
i=1
(‡at improper priors for the bias weights correspond to the absence of terms
for them in the sums for J (A) in form (104)). This recalls display (105) and
suggests that appropriate for regularization can be thought of as a variance
ratio of "observation variance" and prior variance for the weights.
It’s fairly clear how to de…ne Metropolis-Hastings-within-Gibbs algorithms
for sampling from l A; 2 g A; 2 . But it seems that typically the high di-
mensionality of the parameter space combined with the symmetry-derived multi-
modality of the posterior will prevent one from running an MCMC algorithm
long enough to fully detail the posterior It also seems unlikely however, that
detailing the posterior is really necessary or even desirable. Rather, one might
simply run the MCMC algorithm, monitoring the values of l A; 2 g A; 2
corresponding to the successively randomly generated MCMC iterates. An
MCMC algorithm will spend much of its time where the corresponding poste-
rior density is large and we can expect that a long MCMC run will identify
a nearly modal value for the posterior. Rather than averaging neural nets
according to the posterior, one might instead use as a predictor a neural net
corresponding to a parameter vector (at least locally) maximizing the posterior.
Notice that one might even take the parameter vector in an MCMC run
with the largest l A; 2 g A; 2 value and for a grid of 2 values around
the empirical maximizer use the back-propagation algorithm modi…ed to fully
optimize
XN 2
2
(yi f (xi jA)) + 2 J (A)
i=1
over choices of A. This, in turn, could be used with relationship (106) to perhaps
improve somewhat the result of the MCMC "search."
115
Mathematically, a grey-scale image is typically represented by an L M
matrix X = [xlm ] where each xlm 2 f0; 1; 2; : : : ; 254; 255g represents a bright-
ness at location (l; m). A color image is often represented by 3 matrices
X r = [xlm ] ; X g = [xlm ] ; and X b = [xlm ] (again all with integer en-
L M L M L M
tries in f0; 1; 2; : : : ; 254; 255g) representing intensities in red, green, and blue
"channels." The standard machine h learning
i problem is to (based on a train-
r g b
ing set of N images X i or X ; X ; X with corresponding class identities
i
yi 2 f1; 2; : : : ; Kg) produce a classi…er. (For example, a standard test problem
is "automatic" recognition of hand-written digits 0 through 9.)
Simple convolutional neural networks with H hidden layers and a softmax
output layer producing class probabilities are successive compositions of more
or less natural linear and non-linear operations that might be represented as
follows. For H operating on X using some set of real number parameters
AH to produce some multivariate output (we will describe below some kinds
of things that are popularly used) a "deepest layer" of the convolutional neural
net produces
ZH H
X; AH (107)
Then applying another set of operations H 1 to the result (107) using some
set of parameters AH 1 , the next layer of values in the convolutional neural net
is produced as
Z H 1 = H 1 Z H ; AH 1
and so on, with
Zh = h
Z h+1 ; Ah (108)
Zh = h
Z h+1 ; Z h+j ; Ah
Zh = h
Z h+1 ; X; Ah
116
Most of what we have said thus far in this section is not really special to
the problem of image classi…cation (and could serve as a high-level introduction
to general neural net predictors). What sets the "convolutional" neural net-
work …eld apart from "generic" neural network practice is the image-processing-
inspired forms employed in the functions h . The most fundamental form is
one that applies "linear …lters" to images followed by some nonlinear operation.
This creates what is commonly called a "convolution" layer.
To make the idea of a convolutional layer precise, consider the following.
Let F be an R C matrix. Typically this matrix is much smaller than the
image and square (at least when "horizontal" and "vertical" resolutions in the
images are the same), and R and C are often odd. One can then make from
F and X a new matrix F X of dimension (L R + 1) (M C + 1) with
entries
XR XC
(F X)ij = fab x(i+a 1);(j+b 1) (109)
a=1 b=1
Figure 26: Illustration of the use of the 3 3 …lter matrix F with L M image
matrix X to produce the (L 2) (M 2) matrix F X.
117
This convolution operation is linear and it is typical practice to introduce
non-linearity by following convolution operations in a layer with the hinge func-
tion max (u; 0) applied to each element u of the resulting matrix. Sometimes
people (apparently wishing to not lose rows and columns in the convolution
process) "0 pad" an image with extra rows and columns of 0s before doing the
convolution–a practice that strikes this author as lacking sound rationale.
Multiple convolutions are typically created in a single convolution layer.
Sometimes the …lter matrices are …lled with parameters to be determined in
…tting (i.e. are part of Ah in the representation (108)). But they can also be
…xed matrices created for speci…c purposes. For example the 3 3 matrices
2 3 2 3
1 0 1 1 2 1
S vert = 4 2 0 2 5 and S horiz = 4 0 0 0 5
1 0 1 1 2 1
are respectively the vertical and horizontal Sobel …lter matrices, commonly used
in image processing when searching for edges of objects or regions. And various
"blurring" …lters (ordinary arithmetic averaging across a square of pixels and
weighted averaging done according to values of a Gaussian density set at the
center of an integer grid) are common devices meant to suppress noise in an
image.
As multiple layers each with multiple new convolutions are created, there is
potential explosion of the total dimensionality of the sets of Z h and Ah . Two
devices for controlling that explosion are the notions of sampling and pooling
to reduce the size of a Z. First, instead of creating and subsequently using an
entire …ltered image F X, one can use only every sth row and column. In
such a "sampling" operation s is colloquially known as the "stride." Roughly
speaking, this reduces the size of a Z by a factor of s2 . Another possibility is
to choose some block size, of size say s t, and divide an L M image into
roughly
L M
s t
non-overlapping blocks, within a block applying a "pooling" rule like "simple
averaging" or "maximum value." One then uses the rectangular array of these
pooled values as a layer output. This, of course, reduces the size of a Z by a
factor of roughly st. It seems common to apply one of these ideas after each one
or few convolution layers in a network, and especially before reaching the top
and …nal one or few layers. The …nal hidden layers of a convolutional neural net
are of the "ordinary" type described earlier and if the dimensionality of their
inputs are too large, numerical and …tting problems will typically ensue.
118
periods to be of help in predicting response at the current one. To give some
sense of what can be done, consider a generalization of the toy single hidden
layer feed-forward neural net with 3 inputs, 2 hidden nodes, and 2 outputs used
in Section 8.1. Where input/output pairs (x; y) with x 2 <3 and y 2 <2 are
indexed by (time) integer t, the notion of recurrent neural network practice is
to allow values z1t and z2t at the hidden nodes to depend not only upon xt but
also upon z1t 1 and z2t 1 and/or y t 1 .
A so-called Elman Network replaces the basic expressions for moving from
input to hidden layer in Section 8.1 with
where each basis element has prototype parameter and scale parameter . A
common choice of D for this purpose is the standard normal pdf.
A version of this with fewer parameters is obtained by restricting to cases
where 1 = 2 = = M = : This restriction, however, has the potentially
unattractive e¤ect of forcing "holes" or regions of <p where (in each) f (x) 0,
including all "large" x. A way to replace this behavior with potentially di¤ering
119
values in the former "holes" and directions of "large" x is to replace the basis
functions !
x j
K x; j = D
to produce a form
M
X
f (x) = 0 + j hj (x) (111)
j=1
R = fx 2 <p ja1 < x1 < b1 and a2 < x2 < b2 . . . and ap < xp < bp g
for (possibly in…nite) values aj < bj for j = 1; 2; : : : ; p. The basic idea is that
if the values aj and bj can be chosen so that ys corresponding to vectors of
inputs x in a training set in a particular rectangle are "homogeneous," then
a corresponding SEL predictor using training set "rectangle mean responses"
or a 0-1 loss classi…er using training set "rectangle majority classes" might be
approximately optimal.27
The search for good predictors constant on rectangles is fundamentally an
algorithmic matter, rather than something that will have a nice closed form
representation (it is not like ridge regression for example). But (provided
"fast" and "e¤ective" algorithms can be identi…ed) it has things that make it
very attractive. For one thing, there is complete invariance to monotone
transformation of numerical features. It is irrelevant to searches for good
2 7 This is essentailly the same motivation provided for nearest neighbor rules in Section
1.3.3.
120
boundaries for rectangles whether a coordinate of the input x is expressed on an
"original" scale or a log scale or on another (monotone transform of the original
scale). The same predictor/predictions will result. This is a very attractive
and powerful feature and is no doubt partly responsible for the popularity of
rectangle-based predictors as building blocks for more complicated methods (like
"boosting trees").
The structure of predictors constant on rectangles is also an intuitively ap-
pealing one, easily explained and understood. This helps make them very
popular with non-technical consumers of predictive analytics.
In this section we consider two rectangle-based prediction methods, the …rst
(CART) using binary tree structures and the second (PRIM) employing a kind
of "bump-hunting" logic.
and look for an index j1 and a value aj1 < s1 < bj1 (with s1 6= xij1 for any i) so
that splitting the initial rectangle at xji = s1 (to produce the two sub-rectangles
R \ fx 2 <p jxj1 s1 g and R \ fx 2 <p jxj1 > s1 g) so that the resulting two
rectangles minimize
X X 2
SSE = yi y rectangle
rectangles i with xi in
the rectangle
One then splits (optimally) one of the (now) two rectangles on some variable
xj2 at some s2 (with s2 6= xij2 for any i) etc.
Where l rectangles in <p (say R1 ; R2 ; : : : ; Rl ) have been created, and
121
the corresponding SEL tree predictor is
1 X
f^l (x) = yi
# training input vectors xi in Rm(x)
i with xi
in Rm(x)
122
Figure 28: A third representation of the hypothetical p = 2 tree predictor
portrayed in Figure 27.
For a big training sample iid from this joint distribution, all branching will typi-
cally be done on the continuous variable x3 , completely missing the fundamental
fact that it is the (joint) behavior of (x1 ; x2 ) that drives the size of y. (This
example also supports the conventional wisdom that as presented the splitting
algorithm "favors" splitting on continuous variables over splitting on values of
discrete ones.)
123
the class that is most heavily represented in the rectangle to which x belongs.28
The empirical misclassi…cation rate for this predictor (that can be used as a
rectangle-splitting criterion) is
1 X h i
N l
1 X
err = I yi 6= f^l (xi ) = Nm 1 p\
mk(m)
N i=1 N m=1
These latter two criteria are average (across rectangles) measures of "purity"
(near degeneracy) of training set response distributions in the rectangles. Upon
adopting one of these forms to replace SSE in the regression tree discussion,
one has a classi…cation tree methodology. HTF suggest using the Gini index
or cross entropy for tree growing and any of the indices (but most typically the
empirical misclassi…cation rate) for tree pruning according to cost-complexity
(to be discussed next).
contexts be more useful to have the pdmk values themselves than to have only the 0-1 loss
classi…er derived from them.
124
(the total training error for the tree predictor based on T ). For > 0 de…ne
the quantity
C (T ) = jT j + E (T )
(for, in the obvious way, jT j the number of …nal nodes in the candidate tree).29
Write
T ( ) = arg min C (T )
subtrees T
125
Figure 29: Cartoon of functions of , C (T ) for …xed T , and the optimized
version C (T ( )).
C k (T ) = jT j + Ek (T )
(for Ek (T ) the error total for the corresponding tree predictor). Write
Tk ( ) = arg min C k (T )
subtrees T
Then (as in Section 1.3.6), letting k (i) be the index of the fold T k containing
training case i, one computes the cross-validation error
N
1 X
CV ( ) = L f^k(i) (xi ) ; yi
N i=1
126
9.1.4 Measuring the Importance of Inputs for Tree Predictors
Consider the matter of assigning measures of "importance" of input variables
for a tree predictor. In the spirit of ordinary linear models assessment of the
importance of a predictor in terms of some reduction it provides in some error
sum of squares, Breiman suggested the following. Suppose that in a regression
or classi…cation tree, input variable xj provides the rectangle splitting criterion
for nodes node1j ; : : : ; nodem(j)j and that before splitting at nodelj , the relevant
rectangle Rlj has (for y^lj the prediction …t for that rectangle) associated sum
of training losses X
Elj = L (^
ylj ; yi )
i with xi 2Rlj
1 2
and that after splitting Rlj on variable xj to create rectangles Rlj and Rlj
1 2
(with respective …tted predictions y^lj and y^lj ) one has sums of training losses
associated with those two rectangles
X X
1 1 2 2
Elj = L y^lj ; yi and Elj = y^lj ; yi
1
i with xi 2Rlj 2
i with xi 2Rlj
a measure of the importance of xp j in …tting the tree and compare the various
Ij s (or perhaps the square roots, Ij s).
Further, if a predictor is a (weighted) sum of regression trees (e.g. produced
by "boosting" or in a "random forest") and Ijm measures the importance of xj
in the mth tree, then
M
1 X
Ij: = Ijm
M m=1
is perhaps one measure of the importance of xj in the overall predictor. One
can then compare the various Ij: (or square roots) as a means of comparing the
importance of the input variables.
127
"predictor development" or perhaps "conjunctive rule development" from the
context of "market basket analysis." See Section 17.1 in regard to this latter
usage.
PRIM can be thought of as a type of "bump-hunting." For a series of
rectangles (or boxes) in p-space
R1 ; R2 ; : : : ; Rl
1. identify a rectangle
l1 x1 u1
l2 x2 u2
..
.
lp xp up
y rectangle
128
This produces R1 . For what it is worth, step 2. is called "peeling" and step 4.
is called "pasting."
Upon producing R1 , one removes from consideration all training vectors
with xi 2 R1 and repeats 1. through 5. to produce R2 . This continues until a
desired number of rectangles has been created. One may pick an appropriate
number of rectangles (l is a complexity parameter) by cross-validation and then
apply the procedure to the whole training set to produce a set of rectangles and
predictor on p-space that is piece-wise constant on regions built from boolean
operations on rectangles.
PRIM is not anywhere near as common as classi…cation and regression trees,
but shares with them some of their attractive features, especially invariance to
monotone transformation of coordinates of an input vector.
predictor f^ b based on T b
Rather than using these to estimate the prediction error as in Section 16.4,
consider using them to build a predictor.
The possibility considered in Section 8.7 of HTF is the use of bootstrap
aggregation, or "bagging" under SEL. This is use of the predictor
B
1 X ^b
f^bag (x) f (x)
B
b=1
Notice that even for …xed training set T and input x, this is random (varying
with the selection of the bootstrap samples). One might let E denote averaging
over the creation of a single bootstrap sample and f^ be the predictor derived
from such a bootstrap sample and think of
E f^ (x)
as the "true" bagging predictor under SEL (that has the simulation-based ap-
proximation f^bag (x)). One is counting on a law of large numbers to conclude
that f^bag (x) !E f^ (x) as B ! 1. Note too, that unless the operations
applied to a training set to produce f^ are linear, E f^ (x) will di¤er from the
predictor computed from the training data, f^ (x). The primary motivation
for SEL bagging is the hope of averaging (not-perfectly-correlated as they are
built on not-completely-overlapping bootstrap samples) low-bias/high-variance
predictors to reduce variance (while maintaining low bias).
129
A bagged predictor in the 0-1 loss classi…cation case is
B
X h i
f^bag (x) = arg max I f^ b (x) = k
k
b=1
1 X h^ b i h i
B
I f (x) = k ! P f^ (x) = k as B ! 1
B
b=1
One then expects the convergence of OOB(B), and plotting of OOB(B) versus
B is a standard way of trying to assess whether enough bootstrap samples have
3 0 The probability that a particular training case is missed in a bootstrap sample is
1 N 1 N e 1 :37 for N of any reasonable size.
130
been made to adequately represent the limiting predictor. In spite of the fact
that for small B the (random) predictor f^B is built on a small number of samples
trees and is fairly simple, B is not really a complexity parameter, but is rather
a convergence parameter.
Where losses other than SEL or 0-1 loss are involved, exactly how to "bag"
bootstrapped versions of a predictor is not altogether obvious, and apparently
even what might look like sensible possibilities can do poorly.
(Note that no pruning is applied in this development.) Then let f^ b (x) be the
corresponding tree-based predictor (taking values in < in the regression case or
in G = f1; 2; : : : ; Kg in the classi…cation case). A random forest predictor in
the regression case is then
B
1 X ^b
f^B (x) = f (x)
B
b=1
I b (x) = the set of indices of training cases with xi in the same rectangle as x
then P
xi 2I b (x) I [yi = k]
p^bk (x) =
# [xi 2 I b (x)]
131
(the fraction of training cases with xi in the same rectangle as x and yi = k)
estimates this probability using tree b. Then one random forest estimate of
P [y = kjx] is the simple average
B
1 X b
p^k (x)
B
b=1
The basic tuning parameters in the development of f^B (x) are then m, and
nm in , and (if used) a maximum tree depth. Standard default values of parame-
ters are
132
random variable. Only f^rf (x) is …xed.) The fact that the out of bag error will
increase if optimal allowable tree complexity (encoded in nm in and tree depth)
and/or optimal m are exceeded means that a random forest f^rf (x) can indeed
over…t (be too complex for the real information content of the training set).
There is also a fair amount of confusing discussion in the literature about
the role of the random selection of the m predictors to use at each node-splitting
(and the choice of m) in reducing "correlation between trees in the forest." The
Breiman/Cutler web site http://www.stat.berkeley.edu/~breiman/Random
Forests/cc_home.htm says that the "forest error rate" (presumably the error
rate for f^rf ) depends upon "the correlation between any two trees in the forest"
and the "strength of each tree in the forest." The meaning of "correlation" and
"strength" is not clear if anything technical/precise is intended. One possibility
for the …rst is some version of correlation between values of f^ 1 (x) and f^ 2 (x)
as one repeatedly selects the whole training set T in iid fashion from P and then
makes two bootstrap samples— Section 15.4 of HTF seems to use this meaning.32
A meaning of the second is presumably some measure of average e¤ectiveness of
a single f^ b . HTF Section 15.4 goes on to suggest that increasing m increases
both "correlation" and "strength" of the trees, the …rst degrading error rate
and the second improving it, and that the OOB estimate of error can be used
to guide choice of m (usually in a broad range of values that are about equally
attractive) if something besides the default is to be used.
Then in the OOB sample randomly permute the values of the jth coordinate of
~ji . One can then de…ne
the input vectors, producing, say, input vectors x
j 1 X
fb =
err L ~ji ; yi
f^ b x
case i is not in the
# ij i s.t. case i is not in the
bootstrap sample b the b o otstrap sam ple b
…xed training set and a …xed x) between values f^ 1 (x) and f^ 2 (x).
133
as an indicator (for the bth bootstrap sample) of the importance of variable j
to prediction. These can then be averaged across the B bootstrap samples to
produce
B
1 X j
Ij = Ib (113)
B
b=1
Ij
Zj
Sj
for real inputs (i.e. for j = 1; : : : ; r) against
max Z j
j=r+1;:::;r+s
134
The elimination process is intended to ultimately drop from consideration all
those predictors whose scores are not clearly bigger than those of (by construc-
tion useless) shadow predictors.
This is, of course, a heuristic and exact details vary with implementation.
But the central idea is above and makes sense. It can be applied to any bagging
context, and variants of it could be applied where one is not bagging, but other
forms of holding out a test set are employed. Typically, the prediction method
used is the random forest, because of its reputation for broad e¤ectiveness and
its independence of scaling of coordinates of the input. But there is nothing
preventing its use with, say, a linear prediction or smoothing methodology.
and take
b
f^bum p (x) = f^ b (x)
The idea here is that if a few cases in the training data are responsible for making
a basically good method of predictor construction perform poorly, eventually a
bootstrap sample will miss those cases and produce an e¤ective predictor.
Rick (Wen) Zhou in his ISU PhD dissertation made another use of bootstrap-
ping, motivated by a real 2-class classi…cation problem with "covariate shift."
x values in an important test set were mostly unlike input vectors xi available
in a fairly small training set. With relatively little information available in the
training set, highly ‡exible methods like nearest neighbor classi…cation seemed
unlikely to be e¤ective. But a single simple application of a less ‡exible method-
ology (like one based on logistic regression) also seemed unlikely to be e¤ective,
because most test case input vectors were "near" at most "a few" training case
input vectors and extrapolation of some kind was unavoidable.
What Zhou settled on and ultimately found to be relatively e¤ective was to
use (locally de…ned) bootstrap classi…ers based on weighted bootstrap samples,
with weights chosen to depend upon x at which one is classifying. For a test
input vector x 2 <p de…ne weights for training case inputs xi by
2
wi (x) = exp kx xi k
PN
for some appropriate > 0. For w (x) = i=1 wi (x) a single "weighted
bootstrap" sample tailored to the input x can be made by sampling N training
cases iid according to the distribution over i = 1; 2; : : : ; N with probabilities
135
pi (x) = wi (x) =w (x). Upon …tting a simple form of classi…er to B such tailored
samples and using majority voting of those classi…ers, one has a classi…cation
decision for input x. It is one that respects both the likelihood that training
cases close to the input are most relevant to decisions about its likely response
and the need to enforce simplicity on the prediction.
11 "Ensembles" of Predictors
Bagging combines an "ensemble" of predictors consisting of versions of a single
predictor computed from di¤erent bootstrap samples. An alternative might be
to somehow weight together (or otherwise combine) di¤erent predictors (poten-
tially even based on di¤erent models or methods). Here we consider 3 versions
of this basic idea of somehow combining an ensemble of predictors to produce
one better than any element of the ensemble.
We’ll suppose here that m is not known and that it has prior density gm ( m)
(for the mth model) and that and a prior probability for model m is
(m)
pm (x; yj m ) pm (T j m ) gm ( m) (m)
136
Given m (the identity of the "correct" model) the variables T ; m; and (x; y)
have joint density
pm (x; yj m ) pm (T j m ) gm ( m )
for which the conditional mean of yjx; T ; m is, say,
RR
ypm (x; yj m ) pm (T j m ) gm ( m ) d m dy
E [yjx; T ; m] = R R
pm (x; yj m ) pm (T j m ) gm ( m ) d m dy
so that
Z Z
ypm (x; yj m ) pm (T j m ) gm ( m ) d m dy
Z Z
= E [yjx; T ; m] pm (x; yj m ) pm (T j m ) gm ( m ) d m dy
from whence
PM RR
m=1 E [yjx; T ; m] (m) pm (x; yj m ) pm (T j m ) gm ( m ) d m dy
E [yjx; T ] = PM RR
m=1 (m) pm (x; yj m ) pm (T j m ) gm ( m ) d m dy
137
or R
pm (x; yj m ) gm ( m ) d m
pm (yjx) = PK R
y=1 pm (x; yj m ) gm ( m ) d m
These di¤er from the previous "Bayes model averages," but they also represent
sensible ensembles of predictors appropriate in the constituent models.
is e¤ective. Why this can improve on any single one of the f^m s is in some
sense, this is "obvious." The set of possible w (over which one searches for good
weights) includes vectors with one entry 1 and all others 0. But to indicate
in a concrete setting why this might work, consider a case where M = 2 and
according to the P N P joint distribution of (T ; (x; y))
E y f^1 (x) = 0
138
and
E y f^2 (x) = 0
De…ne
f^ = f^1 + (1 ) f^2
Then
2 2
E y f^ (x) =E y f^1 (x) + (1 ) y f^2 (x)
y f^1 (x)
= ( ;1 ) Cov
y f^2 (x) 1
Of course, this isn’t usable in practice, as the mean vector and expected cross
product matrix are unknown.
One practical possibility is to "pick-a-winning" w on the basis of LOO cross-
validation. That is, for f^mi
the mth predictor …t to the training set with the
ith case removed,
N M
!!2
1 X X
yi w0 + wm f^m
i
(xi )
N i=1 m=1
139
that could be optimized as a function of w = (w0 ; w1 ; : : : ; wM ) to produce
wstack and the "stacked" predictor
M
X
f^ (x) = w0 + stack ^
wm fm (x)
m=1
140
through a tree-based function of their voting functions seems likely to be gen-
erally practically e¤ective.34 Here we consider the general problem "predictor
combination." The primary contribution it potentially o¤ers is reduction of
model bias by adding ‡exibility not provided by any individual f^m .
One important way to view the stacked SEL predictor
M
X
w0 + wm f^m (x) (114)
m=1
is as a linear predictor based on M new "features" that are the values of the
ensemble. That suggests applying some standard predictor methodology to a
"training set" consisting of M vectors of predictions ... with or without some or
all of the original input variables also reused as inputs. The generalization of
ordinary stacking is
for some appropriate prediction algorithm f^. As this is more general than
ordinary stacking, it has the potential to be even more e¤ective than a linear
combination of the M predictors could be in SEL problems and is applicable to
other prediction problems.
(Generalized) Stacking is a big deal. From the earliest of the public
predictive analytics contests (the Net‡ix Prize contest run 2006-2009) it has
been common for winning predictions to be made by "end-of-game" merging of
e¤ort by two or more separate teams that in some way combine their separate
predictions. More and more references are made on contest forums to various
strategies for combining basic predictors. Multiple-level versions of the stacking
structure are even discussed.35
While the success of some (?luckiest among a number of?) ad hoc choices
of generalized stacking forms in particular situations is undeniable, principled
choices of forms and parameters for f^ (and indeed f^1 ; f^2 ; : : : ; f^M ) in display
(115) involve both logical subtleties and huge computational demands. As
always, cross-validation (or perhaps its OOB relative in the event that bagging
is involved) is the only sound basis of these choices (and subsequent assessment
of the implications of the choices).
Consider …rst a version of this problem where associated with each f^m and
with the top-level form f^ are grids of possible values of parameters and a (po-
tentially huge) product grid is searched for a best cross-validation error (and
ultimately the optimizing parameter vector is applied to make the pick-the-
winner meta-predictor (115)). For each (vector) element of the product grid,
a cross-validation error is created by holding out folds and …tting f^m s and f^
with the prescribed parameter values on the remainders and testing on the cor-
responding folds. This is a perfectly defensible strategy for choosing a version
3 4 There is, unfortunately, a large and very confused "theoretical" literature on "classi…er
fusion" mostly built around the ad hoc notion of combination via majority voting.
3 5 In truth, they are but structured versions of the general form (115)
141
of predictor form (115). But notice that exactly as discussed in Section 1.3.7,
the "winning" cross-validation error is not an honest indicator of the likely per-
formance of the grid point/predictor ultimately chosen. In order to honestly
estimate Err for the prediction methodology employed, one must cross-validate
the whole process. In each of K remainders one would need to make grids and
cross-validation errors for each grid point and pick a winner to predict on the
corresponding fold in order to produce a cross-validation error for the pick-the-
winner strategy. This implies a large computational load (especially if repeated
cross-validation is done) in order to choose a …nal version of super-learner for
application and assess the e¤ectiveness of the process that produced it.
A second version of this scenario might pertain where ultimately individually-
"optimized" (perhaps by cross-validation across some grid of parameter values
for each m) versions of the f^m s will be combined into a form (115) and choice of
complexity parameters for f^ then made by applying another subsequent "cross-
validation," treating the chosen forms for the f^m as …xed. The only way to
assess the potential performance of this way of predicting is to do it (K times)
on K folds and remainders. That is, within each of K remainders the whole
sequence of choosing parameters for the f^m s and subsequently for the f^ must
be repeated (by making K folds and remainders within each remainder ...
surely leading to di¤erent "best" vectors of parameters for each fold) and applied
to the corresponding fold to …nally get a cross-validation error.
In both of these scenarios, it is clear that computation grows rapidly with
the complexity of constituent predictor forms, the breath of the optimization
desired, and the extent to which repetition of cross-validation is used.
What kind of top-level f^ should be used in predictor form (115) could be
investigated by comparison of cross-validation errors. The linear form (114) is
most common and (at least in its ad hoc application) famously successful. But
there is a very good case to be made that a random forest form has potential to
be at least as e¤ective in this role. Its invariance to scale of its inputs (inherited
from its tree-based heritage) and wide success and reputation as an all-purpose
tool make it a natural candidate.
Neural networks have the kind of "(potentially repeated) composition of mul-
tiple functions of the input vector" character evident in the form (115). That
realization perhaps motivates consideration of versions of generalized stacking
where the ensemble of predictors f^1 ; f^2 ; : : : ; f^M itself has some speci…c kind of
"neural-network-like" structure behind it. Figure 30 is a graphical representa-
tion of what is possible.
It is not at all obvious whether a neural-network-like structure for an ensem-
ble of predictors in generalized stacking is necessarily helpful in practical pre-
diction problems. The folklore in predictive analytics is that ordinary stacking
is most helpful where elements of an ensemble have small correlations. (Obvi-
ously, if they are perfectly correlated no advantage can be gained by "combin-
ing" them.) How that folklore interacts with the current popularity of "deep
learning" methods is unclear. One thing that is clear is that unthinking prolif-
eration of "layers" in development of a predictor where they really add nothing
142
Figure 30: An L-layer structure for prediction based on x.
143
plexity or regularizing parameter, as does M . Small and large M correspond
to large complexity. The boosting notion is di¤erent in spirit from stacking or
model averaging, but like them ends with a linear combination of …tted forms
as a …nal predictor/approximator for E[yjx].
This kind of sequential modi…cation of a predictor is not discussed in or-
dinary regression/linear models courses because if a base predictor is an OLS
predictor for a …xed linear model, corrections to an initial …t based on this same
model …t to residuals will predict that all residuals are 0. In this circumstance
boosting does nothing to change or improve an initial OLS …t.
@
yeim = L (b
y ; yi ) (117)
@b
y b=f^m
y 1 (xi )
These values are the elements of the negative gradient of total loss with respect
to the current predictions for the training set. Ideally, one would like to correct
f^m 1 (x) in a way that moves each prediction of a training output f^m 1 (xi )
by more or less a common multiple of yeim . To that end, one …ts some SEL
predictor, say e^m (x), to "data pairs" (xi ; yeim ). (As in the special case of
SEL boosting, typically some very simple/crude/non-complex form of "base
predictor" is used for e^m .) Let m > 0 (controlling the "step-size" in modifying
f^m 1 (x)) stand for a multiplier for e^m (x) such that
N
X
L f^m 1 (xi ) + me
^m (xi ) ; yi
i=1
144
as an approximate "steepest descent" correction. Of course, other criteria
besides SEL (like AEL) could be used in …tting e^m (x) and could be allowed
to change with m.
The development here allows for arbitrary base predictors. But for good rea-
sons (especially the fact that trees are invariant to monotone transformations of
coordinates of x) the functions e^m are often rectangle-based (and even restricted
to single-split-trees in the case of AdaBoost.M1). If a tree-building algorithm
for approximating the values (117) produces a set of non-overlapping rectan-
gles R1 ; R2 ; : : : ; RL that cover the input space, rather than using for e^m (x) in
rectangle Rl some average of the values yeim for training cases with xi 2 Rl , it
makes more sense to use
X
e^m (x) = arg min L f^m 1 (xi ) + c; yi for x 2 Rl (119)
c
i s.t. xi 2Rl
and m = 1 and this is the form typically used in gradient boosting with trees.
Update form (119) relies upon 1-dimensional optimizations of a sum of losses
for training inputs in L tree-generated rectangles. Another way this idea can
be used is with rectangles formed based on values of sub-vectors of x with …nite
numbers of possible values. That is, consider again the context of Section 1.4.2.
For a given choice of D categorical, ordinal, or …nite-discrete coordinates of x
de…ning the sub-vector x, consider using
X
e^m (x) = arg min L f^m 1 (xi ) + c; yi for x with xi = x
c
i s.t. xi =x
and m = 1. This e^m (x) has only a …nite number of possible values, one
corresponding to each of the sets fijxi = xg. Further, in contexts where there
are a number of potential choices of such sets of discrete coordinates of x, the
total losses after update (118) can be compared to choose a good sub-vector x
to use to produce f^m .
SEL We had a …rst look at SEL boosting in Section 11.4.1. To establish that it
2
is a version of gradient boosting, simply suppose now that L (by ; y) = 12 (b
y y) .
Then
@ 1 2
yeim = (b
y yi ) = yi f^m 1 (xi )
@b
y 2 b=fm 1 (xi )
y
and for SEL the general gradient boosting corrections are indeed based on the
prediction of ordinary residuals.
@
yeim = (jb
y yi j) = sign yi f^m 1 (xi )
@b
y b=f^m
y 1 (xi )
145
So the gradient boosting update step is "…t a SEL predictor for 1s coding the
signs of the residuals from the previous iteration." In the event that the base
predictors are regression trees, the e^m (x) in a rectangle will be a median of 1s
coming from signs of residuals for cases with xi in the rectangle (and thus have
value either 1 or 1, constant on the rectangle).
@
yeim = exp ( yi y^) = yi exp yi f^m 1 (xi )
@b
y b=f^m
y 1 (xi )
@ h i
yeim = (1 yi y^)+ = yi I yi f^m 1 (xi ) < 1
@b
y b=f^m
y 1 (xi )
fk (x) = P [y = kjx]
146
are optimal and can be used to produce optimal 0-1 loss classi…ers. Consider
boosting to produce approximations to f1 (x) ; f2 (x) ; : : : ; fK 1 (x). Begin with
PK 1
K 1 positive predictors f^10 (x) ; f^20 (x) ; : : : ; f^(K 1)0 (x) with k=1 f^k0 (x) <
1. (For example, f^k0 (x) = 1=K will serve.) Then for y^1 ; y^2 ; : : : ; y^K 1 positive
with sum less than 1, with
K
!
X1 KX1
L (^
y ; y) = I [y = k] ln (^yk ) I [y = K] ln 1 y^k
k=1 k=1
let (for k = 1; 2; : : : ; K 1)
@
yeikm = L (^
y ; yi )
@b
yk b k =f^m
y 1 (xi )
1 1
= I [yi = k] I [yi = K] PK 1
f^k(m 1) (xi ) 1 k=1 f^k(m 1) (xi )
For each k …t some SEL predictor, say e^km (x), to pairs (xi ; yeikm ) and for an
appropriate m > 0 set
( m will need to be chosen to be small enough that all f^1m (x) ; f^2m (x) ; : : : ; f^(K 1)m (x)
remain positive with sum less than 1.)
147
"Subsampling" or "stochastic boosting" is the practice of at each iteration
of boosting, instead of choosing an update based on the whole training set,
choosing a fraction of the training set at random and …tting to it (using a
new random selection at each update). This reduces computation time per
iteration and can also improve predictor performance (primarily by reducing
over…t?). Once more, cross-validation can inform the choice of .
A very popular implementation of gradient boosting goes by the name "XGBoost"
(for "eXtreme Gradient Boosting"). This is an R package (with similar imple-
mentations in other systems) that provides a lot of ‡exibility and code that is
very fast to run (even providing parallelization where hardware supports it).
The caret package can be used to do cross-validation based on XGBoost, allow-
ing one to tune on a number of algorithm complexity parameters.
11.4.4 AdaBoost.M1
Consider a 2-class 0-1 loss classi…cation problem with 1=1 coding of output y
(y takes values in G = f 1; 1g). The AdaBoost.M1 algorithm is an exact variant
of the (approximate) gradient boosting algorithm, but is usually described in
other terms. We describe those terms next, and then make the connection to
general boosting.
The standard/original description of the AdaBoost.M1 algorithm is as fol-
lows.
let
N
1 X
err1 = I [yi 6= g1 (xi )]
N i=1
and de…ne
1 err1
1 = ln
err1
148
4. For m = 2; 3; : : : ; M
(b) Let
PN
i=1 wim I [yi 6= gm (xi )]
errm = PN
i=1 wim
(c) Set
1 errm
m = ln
errm
(d) Update weights as
(Classi…ers gm with small errm get big positive weights in the …nal "vot-
ing.")
149
Figure 31: M = 7 consecutive AdaBoost.M1 cuts for a small fake data set.
gm (x) = gm 1 (x) + me
^m (x) (120)
where e^m is an appropriate stump classi…er ("a single split tree" classi…er) and
m is (without loss of generality) a positive constant.
150
Figure 32: Classi…ers corresponding to the voting functions from the cuts indi-
cated in Figure 31.
So, whatever be the positive value of m , e^m (x) should be chosen to minimize
the 0-1 loss error rate for a single cut classi…er for cases weighted proportional
to values exp ( yi gm 1 (xi )).
Consider then choice of m . The derivative of the total training loss with
respect to m is
X X
exp ( yi gm 1 (xi )) exp ( m ) exp ( yi gm 1 (xi )) exp ( m )
i s.t. i s.t.
yi 6=e^m (xi ) yi =^em (xi )
This is 0 when
P
i s.t. exp ( yi gm 1 (xi ))
yi =^em (xi )
exp (2 m) =P
i s.t. exp ( yi gm 1 (xi ))
yi 6=e^m (xi )
151
That is, an optimal m is
0P 1
i s.t. exp ( yi gm 1 (xi ))
1 @ yi =^em (xi ) A
m = ln P
2 i s.t. exp ( yi gm 1 (xi ))
yi 6=e^m (xi )
1 1 rm
= ln
2 rm
for P
i s.t. exp ( yi gm 1 (xi ))
yi 6=e^m (xi )
rm = PN
i=1 exp ( yi gm 1 (xi ))
which is the 0-1 loss error rate for the classi…er e^m where weights on points in
a training set are proportional to exp ( yi gm 1 (xi )).
Notice then that the ratios of the weights at stages m 1 and m satisfy
exp ( yi gm (xi ))
exp ( yi gm 1 (xi ))
exp ( yi (gm 1 (xi ) + m e^m (xi )))
=
exp ( yi gm 1 (xi ))
= exp ( yi m e^m (xi ))
1 1 rm 1 1 rm
= exp ln I [^
em (xi ) = yi ] + exp ln I [^
em (xi ) 6= yi ]
2 rm 2 rm
1=2 1=2
1 rm 1 rm
= I [^
em (xi ) = yi ] + I [^
em (xi ) 6= yi ]
rm rm
1=2
1 rm 1 rm
= I [^
em (xi ) = yi ] + I [^
em (xi ) 6= yi ]
rm rm
Since rm doesn’t depend upon i, looking across i this is proportional to a ra-
tio of 1 for cases with I [^ em (xi ) = yi ] and ratio (1 rm ) =rm for cases with
I [^
em (xi ) = yi ]. That is (recalling the meaning of rm ) the ratios of weights for
a given case in this development are completely equivalent to those produced by
the updating prescribed in 4(d) of the standard description of AdaBoost.M1.
Ultimately then, all of this taken together establishes that this ("exact"
as opposed to "gradient") boosting development produces an mth iterate of
a voting function exactly half of that produced through m iterations of the
standard development of AdaBoost.M1. Since the factor of 21 is irrelevant
to the sign of the voting function, the corresponding classi…er is exactly the
AdaBoost.M1 classi…er.
152
classi…er). His company web site is https://www.rulequest.com/index.html.
His algorithms are very complicated, and complete descriptions do not seem to
be publicly available. (Though there are open source versions of some of his
algorithms, much of his work seems to be proprietary and commercial versions
of his software are no doubt more reliable than the open source versions.) Text-
book descriptions of his methods are generally vague. Probably the best ones
I know of are in the KJ book.
The basic notion of Cubist seems to be to cut up an input space, <p , into
rectangles and …t a (di¤erent) linear predictor for y in each rectangle. Consider
the rectangle
2. Exactly what inputs xj are used in each rectangle and how they are chosen
is not clear. Output for an R implementation of Cubist lists di¤erent sets
for the various rectangles.
3. Exactly how one goes from tree building to the …nal set of rules/rectangles
is not clear. Software seems to not allow control of this. Perhaps there
is some kind of combining of …nal rectangles from a tree.
4. Some sort of "smoothing" is involved. This seems to be some kind of
averaging of regressions for bigger (containing) rectangles "up the tree
branch" from a …nal rectangle. What this should mean is not absolutely
clear if all one has is a set of "rules," particularly if there are cuts less
153
extreme than a …nal pair de…ning Rl that have been eliminated from
description of Rl . For example
3 < x1 < 5
is
3 < x1 < 10 and 3 < x1 < 5
Further, the form of weights used in the averaging seems completely ad
hoc.
There are two serious modi…cations of the basic "tree of regressions" notion
that are included in the R implementation of Cubist:
154
Part III
Intermission: Perspective and
Prediction in Practice
There is more to say about theory and speci…c methodology for statistical ma-
chine learning, but this is a sensible point at which to pause and re‡ect on the
practice of prediction in "big data" contexts. Most of the best-known prediction
methods have been discussed (the notable exception being linear classi…cation
methods and especially so-called support vector machines covered in the next
chapter) and the basic concerns to be faced have been raised. The careful
reader has what is needed in terms of statistical background to begin work
on a large prediction problem. So here we provide a bit of summary discus-
sion/perspective on beginning practice. (The material in the balance of these
notes can be studied in parallel with practice on a large real problem. It is my
belief that such practice grappling with the realities of prediction is essential to
genuine understanding of modern statistical machine learning.)
The graphic in Figure 33 is intended to provide some conceptualization of
what must be done to make predictions and honest judgments of how well they
are likely to work. The graphic is meant to indicate that a project proceeds
more or less left to right through it, but that actual practice is far too iterative
and ‡exible to be adequately represented by a ‡owchart.
One must …rst assemble a training set from whatever sources are appropriate.
Consistent with the "divide and conquer" discussion at the end of Section 11.5,
this training set could represent only a well-de…ned part of a large input space
and multiple graphics like Figure 33 in parallel would then in order. Note
that if a breakup of the input space depends upon the data cases available
(as in Quinlan’s methodologies, where rectangles used depend upon the set
of input vectors considered) that activity is best conceptualized as happening
inside the big cross-validation box, perhaps before several parallel versions of
what is presently Figure 33. The point is that the initial development of the
training set is the conceptual base upon which all else is built and (at least if
one is hoping to have reliable cross-validation results) a "random draws from a
…xed universe" model must be a plausible description of both the elements of
155
the training set and additional "test" cases that are to be predicted.
Figure 33 puts "feature engineering" and "predictor …tting" activities inside
a large single activity box. These are typically spoken of as if they were distinct,
but they are largely indistinguishable/inseparable. (This is usually emphasized
quite strongly by fans of neural network prediction, where the process of devel-
oping weights for linear combinations deep in the compositional structure of the
predictor is often spoken of in terms of "learning good features" for prediction.)
The cross-validation box in Figure 33 encloses all but the assembling of the
training set. This is a reminder of the basic principle that all that will ultimately
be done to make a predictor must be done K times (on the K remainders) in
order to create a reliable assessment of the likely e¤ectiveness of a prediction
methodology. Various "tuning" or "optimizing" steps based on some "cross-
validation error" or "OOB error" measures may be employed in the …tting of
a single one of multiple predictors in an ensemble, but only the kind of com-
prehensive "complete redoing" suggested by placing everything except training
set assembly inside the largest activity box will be adequate as an indication of
likely performance on new test cases.
Ultimately, producing good predictors in big real-world problems is a highly
creative and interesting pursuit. What is presented in these notes amounts to
a set of principles and building blocks that can be assembled in myriad ways.
The fun is in …nding clever problem-speci…c ways to do the assembly that prove
to be practically e¤ective.
Part IV
Supervised Learning II: More on
Classi…cation and Additional
Theory
12 Basic Linear (and a Bit on Quadratic) Meth-
ods of Classi…cation
Consider now methods of producing prediction/classi…cation b
n orules f (x) taking
values in G = f1; 2; : : : ; Kg that have sets x 2 <p jfb(x) = k with boundaries
that are (mostly) de…ned by linear equalities
x0 =c (121)
156
…cients b k ) and then to employ
But this often fails miserably because of the possibility of "masking" if K > 2.
One must be smarter than this. Three kinds of smarter alternatives are Linear
(and Quadratic) Discriminant Analysis, Logistic Regression, and direct searches
for separating hyperplanes. The …rst two of these are "statistical" in origin with
long histories in the …eld.
p=2 1=2 1 0 1
p (xjk) = (2 ) (det ) exp (x k) (x k)
2
Then it follows that
P [y = kjx] k 1 0 1 1 0 1 1
ln = ln k k + l l + x0 ( k l) (122)
P [y = ljx] l 2 2
so that a theoretically optimal classi…er/decision rule is
1 0 1 1
f (x) = arg max ln ( k) k k + x0 k
k 2
and boundaries between regions in <p where f (x) = k and f (x) = l are subsets
of the sets
1 k 1 1 1 1
x 2 <p j x0 ( k l) = ln + 0
k k
0
l l
l 2 2
i.e. are de…ned by equalities of the form (121). Figure 34 illustrates this in a
simple K = 3 and p = 2 context where all k s are the same.
This is dependent upon all K conditional normal distributions having the
same covariance matrix, . In the event these are allowed to vary, condi-
tional distribution k with covariance matrix k , a theoretically optimal predic-
tor/decision rule is
1 1 0 1
f (x) = arg max ln ( k) ln (det k) (x k) k (x k)
k 2 2
and boundaries between regions in <p where f (x) = k and f (x) = l are subsets
of the sets
0 1 0 1
fx 2 <p j 12 (x k) k (x k)
1
2 (x l) l (x l) =
1 1
ln k
l 2 ln (det k) + 2 ln (det l )g
157
Figure 34: Contours of K = 3 bivariate normal pdfs and corresponding linear
(equal class probability) classi…cation boundaries.
bk ( ) = b k + (1 ) b p o oled
for 2 (0; 1) and c2 an estimate of variance pooled across groups k and then
across coordinates xj of x in LDA. Combining these two ideas, one might even
invent a two-parameter set of …tted covariance matrices
bk ( ; ) = b k + (1 ) b p o oled + (1 ) c2 I
for use in QDA. Employing these in LDA or QDA provides the ‡exibility
of choosing a complexity parameter or parameters and potentially improving
prediction performance.
The form x0 is (of course and by design) linear in the coordinates of x. An
obvious natural generalization of this discussion is to consider discriminants that
158
are linear in some (non-linear) functions of the coordinates of x. This is simply
choosing some M basis functions/transforms/features hm (x) and replacing the
p coordinates of x with the M coordinates of (h1 (x) ; h2 (x) ; : : : ; hM (x)) in the
development of LDA.
Of course, upon choosing basis functions that are all coordinates, squares
of coordinates, and products of coordinates of x, one produces linear (in the
basis functions) discriminants that are general quadratic functions of x. The
possibilities opened here are myriad and (as always) "the devil is in the details."
and note that one is free to replace x and all K means k with respectively
1=2 1=2
x = (x ) and k = ( k )
This produces
P [y = kjx ] k 1 2 1 2
ln = ln kx kk + kx lk
P [y = ljx ] l 2 2
and (in "sphered" form) the theoretically optimal classi…er can be described as
1 2
f (x) = arg max ln ( k) kx kk
k 2
That is, in terms of x , optimal decisions are based on ordinary Euclidian
distances to the transformed means k . Further, this form can often be made
even simpler/be seen to depend upon a lower-dimensional (than p) distance.
The k typically span a subspace of <p of dimension min (p; K 1). For
M =( 1; 2; : : : ; K)
p K
159
the last equality coming because (P M x k ) 2 C (M ) and (I PM)x 2
? 2
C (M ) . Since k(I P M ) x k doesn’t depend upon k, the theoretically
optimal predictor/decision rule can be described as
1 2
f (x) = arg max ln ( k) kP M x kk
k 2
and theoretically optimal decision rules can be described in terms of the projec-
tion of x onto C (M ) and its distances to the k .
Now,
1
MM0
K
is the (typically rank min (p; K 1)) sample covariance matrix of the k and
has an eigen decomposition as
1
M M 0 = V DV 0
K
for
D = diag (d1 ; d2 ; : : : ; dp )
where
d1 d2 dp
are the eigenvalues and the columns of V are orthonormal eigenvectors corre-
sponding in order to the successively smaller eigenvalues of K 1
M M 0 . These
v k with dk > 0 specify linear combinations of the coordinates of the l ,
hv k ; l i, with the largest possible sample variances subject to the constraints
that kvk = 1 and hv l ; v k i = 0 for all l < k. These v k are perpendicular vectors
in successive directions of most important unaccounted-for spread of the k .
Then, for l rank M M 0 de…ne
V l = (v 1 ; v 2 ; : : : ; v l )
let
P l = V l V 0l
be the matrix projecting onto C (V l ) in <p : A possible "reduced rank" approx-
imation to the theoretically optimal LDA classi…cation rule is
1 2
fl (x) = arg max ln ( k) kP l x Pl kk
k 2
and l becomes a complexity parameter that one might optimize via cross-
validation to tune or regularize the method.
Note also that for w 2 <p
l
X
P lw = hv k ; wi v k
k=1
160
For purposes of graphical representation of what is going on in these computa-
tions, one might replace the p coordinates of x and the means k with the l
coordinates of
0
(hv 1 ; x i ; hv 2 ; x i ; : : : ; hv l ; x i) (123)
and of the
0
(hv 1 ; k i ; hv 2 ; k i ; : : : ; hv l ; k i) (124)
(that might be called "canonical coordinates"). It seems to be ordered pairs of
entries of these vectors that are plotted by HTF in their Figures 4.8 and 4.11.
In this regard, we need to point out that since any eigenvector v k could be
replaced by v k without any fundamental e¤ect in the above development, the
vector (123) and all of the vectors (124) could be altered by multiplication of
any particular set of coordinates by 1. (Whether a particular algorithm for
…nding eigenvectors produces v k or v k is not fundamental, and there seems
to be no standard convention in this regard.) It appears that the pictures in
HTF might have been made using the R function lda and its choice of signs for
eigenvectors.
exp ( k0 + x0 k)
P [y = kjx] = pk (x; ) = PK 1 (126)
1+ k=1 exp ( k0 + x0 k)
161
and a theoretically optimal (under 0-1 loss) predictor/classi…cation rule is
As a bit of an aside, it is perhaps useful to see in forms (126) and (127) use
of the softmax function with linear combinations of the coordinates of x and
be reminded of the neural network discussion of Section 8.2. In that regard,
consider an extremely simple neural network for classi…cation having no hidden
layers and all coe¢ cients for the last output node set to 0. That is, with
no hidden layers, if in the notation of Section 8.3.1 the last column of A0 by
assumption contains only 0s (A0K = 0), the corresponding "neural network" for
classi…cation is exactly the K-class logistic regression model.
Figure 35 is a plot of three di¤erent p = 1 forms for p1 (x; 0 ; 1 ) in a K = 2
model. The parameter sets are
Red: 0 = 0; 1 = 1;
Blue: 0 = 4; 1 = 2; and
Green: 0 = 2; 1 = 2
162
Figure 36: A plot of a p = 2 form for p1 (x; 0; 1) in a K = 2 model (with 1-2
coding).
This is a mixture model and the complete likelihood is involved, i.e. a joint
density for the N pairs (xi ; yi ). On the other hand, standard logistic regression
methodology maximizes
YN
pyi (xi ; ) (128)
i=1
over choices of . This is not a full likelihood, but rather one conditional on
the xi observed.
In a K = 2 case with 1-1 coding for y, the logistic regression log-likelihood
has a very simple form. With
exp ( 0 + x0 ) 1
p 1 (x; 0; )= and p1 (x; 0; )=
1 + exp ( 0 + x0 ) 1 + exp ( 0 + x0 )
I [yi = 1] I [yi = 1]
exp ( 0 + x0i ) 1
1 + exp ( 0 + x0i ) 1 + exp ( 0 + x0i )
(Note that this term is (ln 2) h1 (yi ( 0 + x0i )) for h1 the …rst of the function
"losses" considered in Section 1.5.3 in the discussion of voting functions in 2-
class classi…cation.) So ultimately, the K = 2 log-likelihood (to be optimized
in ML …tting) is
XN
ln (1 + exp (yi ( 0 + x0i )))
i=1
163
(which is ln 2 times the total loss in the gradient boosting algorithm applied
to voting function g (x) = 0 + x0i ).
A general alternative to maximum likelihood (useful in avoiding over…tting
for large N ) is minimization of a criterion like
N
!
Y
ln pyi (xi ; ) + penalty ( )
i=1
For example, in the K = 2 case (with 1-1 coding) a lasso version is (for >0
and 0 1) minimization of
0 1
N
X p
X p
(1 )X
ln (1 + exp (yi ( 0 + x0i ))) + @ j jj + 2A
j
i=1 j=1
2 j=1
^0 ^cc N0 ^0 cc
0 ln + ln and ^ = ^
N1 1 ^0
3 7 Notice that this methodology purposely creates a situation like that described in Sec-
tion 1.5.1, where training set class relative frequencies are much di¤erent from actual class
probabilities.
164
Figure 37: An example of a quadratic form ( :2x21 :3x22 ) used to make logistic
regression probabilities that y = 1 (for 1-2 coding)
are appropriate for the original context. (This result is a specialization of the
general formula (29) for shifting conditional probabilities for yjx based on use
of a training set with class frequencies di¤erent from the k s.)
Good logistic regression models are the basis of good classi…ers when one
classi…es according to the largest predicted probability. And just as the useful-
ness of LDA can be extended by consideration of transforms/features made from
an original p-dimensional x, the same is true for logistic regression. For ex-
ample, beginning with x1 and x2 and creating additional predictors x21 ; x22 ; and
x1 x2 , one can use logistic regression technology based on the 5-dimensional in-
put x1 ; x2 ; x21 ; x22 ; x1 x2 to create classi…cation boundaries that are quadratic
in terms of the original x1 and x2 . An example of the kind of functional form
for the conditional probability that y = k given a bivariate input x that can
result is portrayed in Figure 37 where the quadratic form :2x21 :3x22 is used
to make logistic regression probabilities that y = 1 (for 1-2 coding). Constant-
probability contours of such a surface are ellipses in (x1 ; x2 )-space.
165
value of 0 for the loglikelihood or 1 for the likelihood, satisfactory 2 <p and
0 from an iteration of the search algorithm will produce separation.
A famous older algorithm for …nding a separating hyperplane is the so-called
"perceptron" algorithm. It can be de…ned as follows. From some starting
points 0 and 00 cycle through the training data cases in order (repeatedly as
needed). At any iteration l, take
n o yi = 1 and x0i + > 0, or
l l 1 l l 1 0
= and = if
0 0 yi = 1 and x0i + 0 0
l l 1
= + yi xi
l l 1 otherwise
and 0 = 0 + yi
g (x) = x0 + 0 (129)
166
This can be thought of in terms of choosing a unit vector u (or direction) in <p so
that upon projecting the training input vectors xi onto the subspace of multiples
of u there is maximum separation between the convex hull of projections of the
xi with yi = 1 and the convex hull of projections of xi with corresponding
yi = 1. (The sign on u is chosen to give the latter larger x0i u than the former.)
If u and 0 solve this maximization problem the (maximum) margin is then
0 1
1BB
C
C
M= B min x0 u max x0i uC
2 @ xi with i xi with A
yi = 1 yi = 1
and the constant that makes the voting function (129) take the value 0 is
0 1
1BB
C
C
0 = B min x0i u + max x0i uC
2 @ xi with xi with A
yi = 1 yi = 1
u 0
maximize M subject to yi x0i + 1 8i (132)
u with kuk = 1 M M
and 0 2 <
Then if we let
u
=
M
167
it’s the case that
1 1
k k= or M =
M k k
so that problem (132) can be rewritten
1 2
minimize k k subject to yi (x0i + 0) 1 8i (133)
2< p 2
and 0 2 <
and
N
X
@FP ( ; 0; )
= i yi xi =0 (135)
@ i=1
yi (x0i + 0) 1 0 8i (136)
the non-negativity conditions
0 (137)
and the orthogonality conditions
168
and plugging these into FP ( ; 0; ) gives a function of only
N
X
1 2
FD ( ) k ( )k i (yi x0i ( ) 1)
2 i=1
1 XX 0
XX
0
X
= i j yi yj xi xj i j yi yj xi xj + i
2 i j i j i
X 1 XX 0
= i i j yi yj xi xj
i
2 i j
1
= 10 0
H
2
for
H = (yi yj x0i xj ) (140)
N N
Then the "dual" problem for problem (133) is the N -dimensional optimization
problem
1
maximize 10 0
H subject to 0 and 0
y=0 (141)
2<N 2
and apparently this problem is easily solved.
Now condition (138) implies that if iopt > 0
yi x0i opt
+ 0
opt
=1
so that
1. by condition (136) the corresponding xi has minimum x0i ( opt ) for train-
ing vectors with yi = 1 or maximum x0i ( opt ) for training vectors with
yi = 1 (so that xi is a support vector for the "slab" of thickness 2M
around a separating hyperplane),
opt
2. 0 ( ) may be determined using the corresponding xi from
opt
yi 0 =1 yi x0i opt
i.e. 0
opt
= yi x0i opt
169
PN
The fact (139) that ( ) i=1 i yi xi implies that only the training
cases with i > 0 (typically corresponding to a relatively few support vectors)
determine the nature of the solution to this optimization problem. Further, for
SV the set indices of support vectors in the problem,
2 X X opt opt
opt 0
= i j yi yj xi xj
i2SV j2SV
X opt
X opt 0
= i j yi yj xj xi
i2SV j2SV
X opt opt
= i 1 yi 0
i2SV
X opt
= i
i2SV
the next to last of these equalities following from 3. above, and the last following
from the gradient condition (134). Then the margin for this problem is simply
1 1
M= opt )k
= qP (142)
k ( opt
i2SV i
yi (x0i + 0) + i 1 8i
(the i are called "slack" variables and provide some "wiggle room" in search
for a hyperplane that "nearly" separates the two classes with a good margin).
We might try to control the total amount of slack allowed by setting a bound
N
X
i C
i=1
170
implies that a budget C allows for at most C misclassi…cations in the training
set. And in a non-separable case, C must be allowed to be large enough so that
some choice of 2 <p and 0 2 < produces a classi…er with training error rate
no larger than C=N .
In any event, we consider the optimization problem
1 2 yi (x0i + 0 ) + i 1 8i
minimize k k subject to PN
2 <p 2 for some i 0 with i=1 i C
and 0 2 <
(143)
that can be thought of as generalizing the problem (133). Problem (143) is
equivalent to
yi (x0i u + 0) M (1 ) 8i
maximize M subject to PN i
u with kuk = 1 for some i 0 with i=1 i C
and 0 2 <
generalizing the original problem (131). In this latter formulation, the i rep-
resent fractions (of the margin) that a corresponding xi is allowed to be on the
"wrong side" of its cushion around the classi…cation boundary. i > 1 indicates
that not only does xi violate its cushion around the surface in <p de…ned by
x0 u + 0 = 0 but that the classi…er misclassi…es that case.
The ideas and notation of this development are illustrated in Figure 39 for
a small p = 2 problem.
171
A nice development on pages 376-378 of Izenman’s book provides the follow-
ing solution to this problem (144) parallel to the development in Section 13.1.
Generalizing problem (141) is the dual problem
1
maximize 10 0
H subject to 0 C 1 and 0
y=0 (145)
2<N 2
for
H = (yi yj x0i xj ) (146)
N N
for SV the set of indices of support vectors xi which have iopt > 0. The
points with 0 < iopt < C will lie on the edge of the margin (have i = 0) and
the ones with iopt = C have i > 0. Any of the support vectors on the edge
of the margin (with 0 < iopt < C ) may be used to solve for 0 2 < as
opt
0 = yi x0i opt
(148)
172
Figure 40: Two support vector classi…ers for a small p = 2 problem.
13.3.1 Heuristics
Let K be a non-negative de…nite kernel and consider the possibility of using
functions K (x; x1 ) ; K (x; x2 ) ; : : : ; K (x; xN ) to build new (N -dimensional data-
dependent) feature vectors
0 1
K (x; x1 )
B K (x; x2 ) C
B C
k (x) = B .. C
@ . A
K (x; xN )
173
for any input vector x (including the xi in the training set) and rather than
de…ning inner products for new feature vectors (for input vectors x and z) in
terms of <N inner products
N
X
0
hk (x) ; k (z)i = k (x) k (z) = K (x; xk ) K (z; xk )
k=1
instead consider using the abstract space inner products of corresponding func-
tions
hK (x; ) ; K (z; )iA = K (x; z)
Then, in place of de…nition (140) or (146) de…ne
opt
and let solve either problem (141) or (145). With
N
X
opt opt
= i yi k (xi )
i=1
as in the developments of the previous sections, we replace the <N inner product
of ( opt ) and a feature vector k (x) with
*N + N
X opt X opt
i yi K (xi ; ) ; K (x; ) = i yi hK (xi ; ) ; K (x; )iA
i=1 i=1
A
N
X opt
= i yi K (x; xi )
i=1
Then for any i for which iopt > 0 (an index corresponding to a support feature
vector in this context) we set
N
X
opt opt
0 = yi j yj K (xi ; xj )
j=1
and have an empirical analogue of voting function (129) (for the kernel case)
N
X opt opt
g^ (x) = i yi K (x; xi ) + 0 (150)
i=1
f^ (x) = sign (^
g (x)) (151)
174
13.3.2 A Penalized-Fitting Function-Space Optimization Argument
The heuristic argument for the use of kernels in the SVM context to produce
form (150) and classi…er (151) is clever enough that some authors simply let
it stand on its own as "justi…cation" for using "the kernel trick" of replacing
<N inner products of feature vectors with A inner products of basis functions.
Far more satisfying arguments can be made. One is based on an appeal to
optimality/regularization considerations provided in a 2002 Machine Learning
paper of Lin, Wahba, Zhang, and Lee.
Consider A, an abstract function space38 associated with the non-negative
de…nite kernel K, and the penalized …tting optimization problem involving the
"hinge loss" from Section 1.5.3,
N
X 1 2
minimize (1 yi ( 0 + g (xi )))+ + kgkA (152)
g2A i=1
2
and 0 2 <
Dividing the whole optimization criterion in display (152) (hinge loss plus con-
stant times squared A norm) by N , we see that an empirical version of the
expected hinge loss is involved, and can on the basis of the exposition in Sec-
tion 1.5.3 hope that an element g of A and value 0 will be identi…ed in the
minimization so that 0 + g (x) is close to the voting function for the optimal
0-1 loss classi…er and controls 0-1 loss error rate.
Further, recalling the form (143), the quantity (1 yi (x0i + 0 ))+ is the
fraction of the margin (M ) that input xi violates its cushion around the classi-
…cation boundary hyperplane. (Points on the "right" side of their cushion don’t
get penalized at all. Ones with (1 yi (x0i + 0 ))+ = 1 are on the classi…ca-
tion boundary. Ones with (1 yi (x0i + 0 ))+ > 1 are points misclassi…ed by
the voting function.) The average of such terms is an average fraction (of the
margin) violation of the cushion and the optimization seeks to control this, and
so the loss really is related to the SV classi…cation ideas.
Then, exactly as will be noted in Section 15, an optimizing g 2 A above
must be of the form
N
X
g (x) = j K (x; xj )
j=1
0
= k (x)
so the minimization problem is
2
N
X N
0 1 X
minimize 1 yi 0+ k (xi ) + + j K (x; xj )
2 <N i=1
2 j=1
A
and 0 2 <
3 8 To be technically precise, we are talking here about the "Reproducing Kernel Hilbert
Space" (RKHS) related to K. This an abstract function space A consisting of all linear
combinations of slices of the kernel, K (x; ) and limits of such linear combinations.
175
that is,
N
X
0 1 0
minimize 1 yi 0 + k (xi ) +
+ K
2 <N i=1
2
and 0 2 <
for
K = (K (xi ; xj ))
N N
1
maximize 10 0
H subject to 0 1 and 0
y=0 (154)
2<N 2
or
1 1
maximize 10 0
2
H subject to 0 1 and 0
y = 0 (155)
2<N 2
That is, the function space optimization problem (152) has a dual that is the
same as for problem (145) for the choice of C = and kernel 12 K (x; z)
produced by the heuristic argument in Section 13.3.1. Then, if opt is a solution
to (154), Lin et al. say that an optimal 2 <N is
1 opt
diag (y1 ; : : : ; yN )
get applied to the functions 12 K (xi ; ). Upon recognizing that opt = 1 opt
it becomes evident that for the choice of C = and kernel 12 K, the heuristic
in Section (13.3.1) produces a solution to the optimization problem (152).39
3 9 Di¤erently put, the "kernel trick" of Section (13.3.1) applied to kernel K with cost para-
meter C solves the present optimization problem applied to kernel (C )2 K with weighting
= C in the problem (152).
176
13.3.3 A Function-Space-Support-Vector-Classi…er Geometry Argu-
ment
A di¤erent line of argument produces a SVM in a way that connects it to the
geometry of support vector classi…cation in <p . The basic idea is to recognize
that one is mapping input feature vectors to an abstract function space A via
the mapping
T (x) ( ) = K (x; )
and that everything subsequent to this mapping can be done fully honoring the
linear space structure. That is, the translation of the support vector classi…er
argument should be in reference to the geometry of A. What one is really de…n-
ing is a classi…er with inputs in A. "Linear classi…cation" in A is the analogue
of support vector classi…cation in <p if one starts from a geometric motivation
like that of the support vector classi…er development. One seeks a unit vector
(now in A) and a constant so that inner products of the (transformed) data
case inputs with the unit vector plus the constant, when multiplied by the yi ,
maximize a margin subject to some relaxed constraints.
All this is writable in terms of A. That is, one wishes to
maximize M
U 2 A with kU kA = 1
and 0 2 <
yi (hT (xi ) ; U iA + 0 ) M (1 i) 8i
subject to PN
for some i 0 with i=1 i C
This is equivalent to the problem
1 2 yi (hT (xi ) ; V iA + 0 ) (1 i) 8i
minimize kV kA subject to PN
V 2A 2 for some i 0 with i=1 i C
and 0 2 <
Then either because optimization over all of A looks too hard, or because
some "Representer Theorem" says that it is enough to do so, one might back
o¤ from optimization over A to optimization over the subspace of spanned by
the set of N elements T (xi ). Then writing
N
X
V = iT (xi )
i=1
so that
N N
1 2 1 XX 1 0
kV kA = i j hT (xi ) ; T (xj )iA = K
2 2 i=1 j=1 2
(again, K is the Gram matrix) the optimization problem becomes
1 yi 0 K i + 0 (1 ) 8i
minimize 0
K subject to PN i
2< N 2 for some i 0 with i=1 i C
and 0 2 <
177
opt opt
where K i is the ith column of the Gram matrix. For and 0 solutions
to the optimization problem and
N
X opt
V opt = i T (xi )
i=1
The corresponding voting function for the derived non-linear classi…er on <p
is
XN
opt opt opt opt
T (x) ; V A
+ 0 = i K (x; xi ) + 0
i=1
and one has something very similar to the heuristic application of the "kernel
trick." The question is whether it is exactly equivalent to the use of "the trick."
The problem solved by opt and 0opt is equivalent for some 0 to
N
X 0
1 0 yi Ki + 0 (1 i) 8i
minimize K + subject to
2 <N 2 for some i 0
i=1
and 0 2 <
(156)
Comparison of display (156) to display (153) and consideration of the argument
following statement (153) then shows that there is a choice of C for which
2
when using kernel (1=C ) K the heuristic/"kernel trick" method produces a
solution to the present function-space-support-vector-classi…er problem. This
is the same circumstance as in the penalized …tting function space optimization
argument.40
the present geometric optimization problem applied to kernel (C )2 K with cost parameter
C .
178
Figure 41: 4 SVM voting functions for a small p = 1 example with N = 20
cases. Red bars on the rug correspond to y = 1 cases and blue bars
correpsond to y = 1 cases. Shown are voting functions based on kernels
2
K (x; z) = exp (x z) . The black bars pointing down indicate support
"vectors."
functions. The C = 1000 pictures are closer to being the "hard margin"
situation and have fewer training case errors in evidence.
Remember in all this, that SVMs built on a kernel K will choose voting
functions that are linear combinations of the functions K (xi ; ), slices of the
kernel at training case inputs. That fact controls what "shapes" are possible
for those voting functions. (In this regard, note that the kernel de…ned by
the ordinary Euclidean inner product, K (x; z) = hx; zi, produces linear voting
functions and thus linear decision boundaries in <p and the special case of
ordinary support vector classi…ers. It is sometimes called the "linear kernel.")
Finally, it is important to keep in mind that to the extent that SVMs produce
good voting functions, those must be equivalent to approximate likelihood ratios.
The discussion of Section 1.5.1 still stands.
179
A heuristic "one-versus-all" (OVA) strategy might be the following. Invent
2-class problems (K of them), the kth based on
1 if yi = k
yki =
1 otherwise
or, equivalently
0 1
X h i
f^ (x) = arg max @ I f^km (x) = 1 A
k2G
m6=k
L1 (b
y ; y) = max (0; jy ybj )
or
2
L2 (b
y ; y) = max 0; (y yb)
and be led to the kind of optimization methods employed in the SVM classi…-
cation context. See Izenman pages 398-401 in this regard.
180
14 Prototype and (More on) Nearest Neighbor
Methods of Classi…cation
We saw when looking at "linear" methods of classi…cation in Section 12 that
these can reduce to classi…cation to the class with …tted mean "closest" in some
appropriate sense to an input vector x. A related notion is to represent classes
each by several "prototype" vectors of inputs, and to classify to the class with
closest prototype. In this section we have these and related nearest neighbor
classi…ers in view.
So consider a K-class classi…cation problem (where y takes values in G =
f1; 2; : : : ; Kg) and suppose that the coordinates of input x have been standard-
ized according to training means and standard deviations.
For each class k = 1; 2; : : : ; K, represent the class by prototypes
z k1 ; z k2 ; : : : ; z kR
(that is, one classi…es to the class that has a prototype closest to x).
The most obvious question in using such a rule is "How does one choose the
prototypes?" One standard (admittedly ad hoc, but not unreasonable) method
is to use the so-called "K-means (clustering) algorithm" (see Section 17.2.1) one
class at a time. (The "K" in the name of this algorithm has nothing to do with
the number of classes in the present context. In fact, here the "K" naming
the clustering algorithm is our present R, the number of prototypes used per
class. And the point in applying the algorithm is not so much to see exactly
how training vectors aggregate into "homogeneous" groups/clusters as it is to
…nd a few vectors to represent them.)
For Tk = fxi with corresponding yi = kg an "R "-means algorithm might
proceed by
181
reasonable) that moves prototypes in the direction of training input vectors in
their own class and away from training input vectors from other classes. One
such method is known by the name "LVQ"/"learning vector quantization." This
proceeds as follows.
With a set of prototypes (chosen randomly or from an R-means algorithm
or some other way)
z kl k = 1; 2; : : : ; K and l = 1; 2; : : : ; R
z new
kl = z kl + m (xi z kl )
z new
kl = z kl m (xi z kl )
A nearest neighbor method is to classify x to the class with the largest repre-
sentation in nl (x) (possibly breaking ties at random). That is, de…ne
X
f^ (x) = arg max I [yi = k] (157)
k
xi 2nl (x)
182
for
1 1 1 1
Q=W 2 W 2 BW 2 + I W 2
1 1
=W 2 (B + I) W 2 (158)
(xk is the average xi from class k in the 50 used to create the local metric), B
is a weighted between-class covariance matrix of sample means
K
X 0
B= ^k (xk x) (xk x)
k=1
1
Notice that in form (158), the "outside" W 2 factors "sphere" (z x) di¤er-
ences relative to the within-class covariance structure. B is then the between-
class covariance matrix of sphered sample means. Without the I, the distance
would then discount di¤erences in the directions of the eigenvectors correspond-
ing to large eigenvalues of B (allowing the neighborhood de…ned in terms of
D to be severely elongated in those directions). The e¤ect of adding the I
term is to limit this elongation to some degree, preventing xi "too far in terms
of Euclidean distance from x" from being included in nl (x).
A global use of the DANN kind of thinking might be to do the following.
At each training input vector xi 2 <p , one might again use regular Euclidean
distance to …nd, say, 50 neighbors and compute a weighted between-class-mean
covariance matrix B i as above (for that xi ). These might be averaged to
produce
N
1 X
B= Bi
N i=1
183
15 Reproducing Kernel Hilbert Spaces: Penal-
ized/Regularized and Bayes Prediction
A general framework that uni…es many interesting regularized …tting methods
is that of reproducing kernel Hilbert spaces (RKHSs). There is a very nice 2012
Statistics Surveys paper by Nancy Heckman (related to an older UBC Statistics
Department technical report (#216)) entitled "The Theory and Application of
Penalized Least Squares Methods or Reproducing Kernel Hilbert Spaces Made
Easy," that is the nicest exposition I know about of the connection of this
material to splines. Parts of what follows are borrowed shamelessly from her
paper. There is also some very helpful stu¤ in CFZ Section 3.5 and scattered
through Izenman about RKHSs.41
as a Hilbert space (a linear space with inner product where Cauchy sequences
have limits) with inner product
Z 1
hf; giA f (0) g (0) + f 0 (0) g 0 (0) + f 00 (x) g 00 (x) dx
0
1=2
(and corresponding norm khkA = hh; hiA ).
With this de…nition of inner
product and norm, for x 2 [0; 1] the (linear) functional (a mapping A ! <)
Fx [f ] f (x)
Fx [f ] = hRx ; f iA = f (x) 8f 2 A
ican Statistician (2006) and "Kernel Methods in Machine Learning" by Hofmann, Scholköpf,
and Smola in The Annals of Statistics (2008).
184
Then the function of two variables de…ned by
R (x; z) Rx (z)
L [f ] (x) = f 00 (x)
Then the optimization problem solved by the cubic smoothing spline is mini-
mization of
XN Z 1
2 2
(yi Fxi [h]) + (L [h] (x)) dx (159)
i=1 0
over choices of h 2 A.
It is possible to show that the minimizer of the quantity (159) is necessarily
of the form
N
X
h (x) = 0 + 1 x + i R1xi (x)
i=1
for
0
T = (1; X) ; = ; and K = (R1xi (xj ))
N 2 1
185
One may adopt an inner product for A of the form
m
X1 Z b
hf; giA f (k) (a) g (k) (a) + L [f ] (x) L [g] (x) dx
k=0 a
and have a RKHS. The assumption is made that the functionals Fi are con-
tinuous and linear, and thus that they are representable as Fi [h] = hfi ; hiA for
some fi 2 A. An important special case is that where Fi [h] = h (xi ), but
Rb
other linear functionals have been used, for example Fi [h] = a Hi (x) h (x) dx
for known Hi .
The form of the reproducing kernel implied by the choice of this inner prod-
uct is derivable as follows. First, there is a linearly independent set of functions
fu1 ; : : : ; um g that is a basis for the subspace of A consisting of those elements
h for which L [h] = 0 (the zero function). Call this subspace A0 . The so-called
Wronskian matrix associated with these functions is then
(j 1)
W (x) = ui (x)
m m
With
0 1
C = W (a) W (a)
let X
R0 (x; z) = Cij ui (x) uj (z)
i;j
186
These facts can be used to show that a minimizer of quantity (160) exists
and is of the form
m
X N
X
h (x) = k uk (x) + i R1 (xi ; x)
k=1 i=1
For h (x) of this form, loss plus penalty (160) is of the form
0 0
(Y T K ) D (Y T K )+ K
for
0 1
1
B C
B 2 C
T = (Fi [uj ]) ; =B .. C ; D = diag (d1 ; : : : ; dm ) ; and K = (Fi [R1 ( ; xj )])
@ . A
m
187
P1
fR 2 L2 (C) has an expansion in terms of the i as i=1 hf; i i2 i for hf; hi2
C
f (x) h (x) dx ), and the ones corresponding to positive i may be taken
to be continuous on C. Further, the i are "eigenfunctions" of the kernel K
corresponding to the "eigenvalues i " in the sense that in L2 (C)
Z
i (z) K (z; ) dz = i i ( )
supposing that the series converges appropriately (called the "dual form" of
functions in the space). The former is most useful for producing simple proofs,
while the second is most natural for application, since how to obtain the i and
corresponding i for a given K is not so obvious. Notice that
1
X
K (z; x) = i i (z) i (x)
i=1
X1
= ( i i (x)) i (z)
i=1
P1 P1
and letting i i (x) = ci (x), since i=1 c2i (x) = i = i=1 i 2i (x) = K (x; x) <
1, the function K ( ; x) is of the form (162),
Pso that we can expect functions of
1
the form (163) with absolutely convergent i=1 bi to be of form (162).
In the space of functions (162), we de…ne an inner product (for our Hilbert
space) *1 +
X 1
X X1
ci di
ci i ; di i
i=1 i=1 i=1 i
A
so that
1 2 1 2
X X c i
ci i
i=1 i=1 i
A
P1
Note then that for f = i=1 ci i belonging to the Hilbert space A,
1
X 1
X
ci i i (x)
hf; K ( ; x)iA = = ci i (x) = f (x)
i=1 i i=1
188
and so K ( ; x) is the representer of evaluation at x. Further,
hK ( ; z) ; K ( ; x)iA K (z; x)
which is the reproducing property of the RKHS.
Notice also that for two linear combinations of slices of the kernel function
at some set of (say, M ) inputs fz i g (two functions in A represented in dual
form)
M
X M
X
f()= ci K ( ; z i ) and g ( ) = di K ( ; z i )
i=1 i=1
the corresponding A inner product is
M
* M
+
X X
hf; giA = ci K ( ; z i ) ; dj K ( ; z j )
i=1 j=1 A
M X
X M
= ci dj K (z i ; z j )
i=1 j=1
0 1
d1
B C
= (c1 ; : : : ; cM ) (K (z i ; z j )) i=1;:::;M @ ... A
j=1;:::;M
dM
This is a kind of <M inner product of the coe¢ cient vectors c0 = (c1 ; : : : ; cM )
and d0 = (d1 ; : : : ; dM ) de…ned by the nonnegative de…nite matrix (K (z i ; z j )).
Further, if a random M -vector Y has covariance matrix (K (z i ; z j )), this is
2
Cov c0 Y ; d0 Y . So, in particular, for f of this form kf kA = hf; f iA =Var(c0 Y ).
For applying this material to the …tting of training data, for > 0 and a
loss function L (b y ; y) 0 de…ne an optimization criterion
N
!
X 2
minimize L (f (xi ) ; yi ) + kf kA (164)
f 2A
i=1
As it turns out, an optimizer of this criterion must, for the training vectors fxi g,
be of the form
XN
^
f (x) = bi K (x; xi ) (165)
i=1
2
and the corresponding f^ is then
A
D E N X
X N
f^; f^ = bi bj K (xi ; xj )
A
i=1 j=1
189
With the Gram matrix K = (K (xi ; xj )) and P = K a symmetric general-
ized inverse of K and de…ning
0 1
N
X N
X
LN (Kb; Y ) L@ bj K (xi ; xj ) ; yi A
i=1 j=1
minimize LN (Kb; Y ) + b0 Kb
b2<N
i.e.
minimize LN (Kb; Y ) + b0 K 0 P Kb
b2<N
i.e.
minimize (L (v; Y ) + v 0 P v) (167)
v2C(K)
That is, the function space optimization problem (164) reduces to the N -
dimensional optimization problem (167). A v 2 C (K) (the column space of
K) minimizing LN (v; Y )+ v 0 P v corresponds to b minimizing LN (Kb; Y )+
b0 Kb via
Kb = v (168)
2
For the particular special case of squared error loss, L (b
y ; y) = (y yb) , this
development has a very explicit punch line. That is,
0
LN (Kb; Y ) + b0 Kb = (Y Kb) (Y Kb) + b0 Kb
Yb = v = K (K + I)
1
Y
Then using fact (169) under squared error loss, the solution to problem (164) is
from expression (165)
XN
f^ (x) = b i K (x; xi ) (170)
i=1
190
for eigenvalues 1 2 N > 0, where the eigenvector columns of U
comprise an orthonormal basis for <N . The penalty in form (167) is
0
v 0 P v = v 0 U diag (1= 1 ; 1= 2 ; : : : ; 1= N) U v
N
X 1 2
= hv; uj i
j=1 j
The extra generality provided by this theorem for the squared error loss case
treated above is that it provides for linear combinations of the functions i (x)
to be unpenalized in …tting. Then for
=( j (xi ))
N M
191
and
0 1 0
R= I Y
1
an optimizing is ^ = 0 0
Y and ^ optimizes
0 0
(R K ) (R K )+ K
1
and the earlier argument implies that ^ = (K + I) R.
and the basis functions K (x; xi ) are essentially spherically symmetric normal
pdfs with mean vectors xi . (These are "Gaussian radial basis functions" and for
p = 2, functions (165) produce prediction surfaces in 3-space that have smooth
symmetric "mountains" or "craters" at each xi ; of elevation or depth relative
to the rest of the surface governed by bi and extent governed by .)
Of course, Section 1.4.3 provides a number of insights that enable the cre-
ation of a wide variety of kernels beyond the few mentioned here.
Then, for example, the standard development of so-called "support vector
classi…ers" in a 2-class context with y taking values 1, uses some kernel K (z; x)
and voting function
1
X
g (x) = b0 + bi K (x; xi )
i=1
L (g (x) ; y) = [1 yg (x)]+
192
15.3.2 Addendum Regarding the Structures of the Spaces Related
to a Kernel
An ampli…cation of some aspects of the basic description provided above for the
RKHS corresponding to a kernel K : C C ! < is as follows.
From the representation
1
X
K (z; x) = i i (z) i (x)
i=1
write
p
i = i i
(Note that considered as functions in L2R (C) the i are orthogonal, but not
2
generally orthonormal, since h i ; i i2 C i i
(x) dx = i which is typically
not 1.) Representation (171) suggests that one think about the inner product
for inputs provided by the kernel in terms of a transform of an input vector
x 2 <p to an in…nite-dimensional feature vector
and then "ordinary <1 inner products" de…ned on those feature vectors.
The function space A has members of the (primal) form
1
X 1 2
X c i
f (x) = ci i (x) for ci with <1
i=1 i=1 i
193
So, two
P1elements of P A written in terms of the i (instead of their multiples i )
1
say i=1 ci i and i=1 di i have A inner product that is the ordinary <1
inner product of their vectors of coe¢ cients.
Now consider the function
X1
K ( ; x) = i ( ) i (x)
i=1
P1
For f = i=1 ci i 2 A,
* 1
+ 1
X X
hf; K ( ; x)iA = ci i ; K ( ; x) = ci i (x) = f (x)
i=1 A i=1
and (perhaps more clearly than above) indeed K ( ; x) is the representer of eval-
uation at x in the function space A.
194
is a valid correlation function for a Gaussian process on <p . Standard forms
for correlation functions in one dimension are ( ) = exp c 2 and ( ) =
exp ( c j j).42 The …rst produces "smoother" realizations than does the sec-
ond, and in both cases, the constant c governs how fast realizations vary.
One may then consider the joint distribution (conditional on the xi and
assuming that for the training values yi the i are iid independent of the (xi ))
of the training output values and a value of (x). From this, one can …nd the
conditional mean for (x) given the training data. To that end, let
2
= (xi xj ) i=1;2;:::;N
N N j=1;2;:::;N
for 0 1
2
(x x1 )
B 2
(x x2 ) C
B C
(x) = B .. C
N 1 @ . A
2
(x xN )
Then standard multivariate normal theory says that the conditional mean of
(x) given Y is
0 1
y1 (x1 )
B (x2 ) C
0 1 B y2 C
f^ (x) = (x) + (x) + 2I B .. C (172)
@ . A
yN (xN )
Write 0 1
y1 (x1 )
B y2 (x2 ) C
2 1B C
w = + I B .. C (173)
N 1 @ . A
yN (xN )
and then note that form (172) implies that
N
X
f^ (x) = (x) + wi 2
(x xi ) (174)
i=1
4 2 See Section 1.4.3 for other p = 1 bounded non-negative de…nite functions that can be
195
and we see that this development ultimately produces (x) plus a linear com-
bination of the "basis functions" 2 (x xi ) as a predictor. Remembering
that 2 (x z) must be positive de…nite and seeing the ultimate form of the
predictor, we are reminded of the RKHS material.
In fact, consider the case where (x) 0. (If one has some non-zero prior
mean for (x), arguably that mean function should be subtracted from the
raw training outputs before beginning the development of a predictor. At a
minimum, output values should probably be centered before attempting devel-
opment of a predictor.) Compare displays (173) and (174) to displays (169)
and (170) for the (x) = 0 case. What is then clear is that the present
"Bayes" Gaussian process development of a predictor under squared error loss
based on a covariance function 2 (x z) and error variance 2 is equivalent
to a RKHS regularized …t of a function to training data based on a kernel
K (x; z) = 2 (x z) and penalty weight = 2 .
196
A slightly di¤erent and semi-empirical version of this expected prediction error
(176) is the "in-sample" test error (7.12) of HTF
N
1 X
Err (xi )
N i=1
where the expectations indicated in form (177) are over yi Pyjx=xi (the
^
entire training sample used to choose f , both inputs and outputs, is being held
constant in the averaging in display (177)). The di¤erence
op = ErrT in err
is called the "optimism of the training error." HTF use the notation
where the averaging indicated by EY is over the outputs in the training set
(using the conditionally independent yi s, yi Pyjx=xi ). HTF say that for
many losses
N
2 X
!= CovY (b yi ; yi ) (179)
N i=1
For example, consider the case of squared error loss. There
! = EY (ErrT in err)
N
X N
1 2 1 X 2
= EY EY yi f^ (xi ) EY yi f^ (xi )
N i=1
N i=1
N
X
2
= EY yi f^ (xi ) EY EY yi f^ (xi )
N i=1
N
2 X Y ^
= E f (xi ) (yi E [yjx = xi ])
N i=1
N
2 X
= CovY (b
yi ; yi )
N i=1
197
We note that in this context, assuming that given the xi in the training data the
outputs are uncorrelated and have constant variance 2 , by relationship (57)
2 2
!= df Yb
N
b
err + ! (180)
16.2.2 BIC
For situations where …tting is done by maximum likelihood, the Bayesian Infor-
mation Criterion of Schwarz is an alternative to AIC. That is, where the joint
distribution P produces density P (yj ; x) for the conditional distribution of yjx
and b is the maximum likelihood estimator of , a (maximized) log-likelihood
is
XN
loglik = log P yi jb; xi
i=1
198
and the so-called Bayesian information criterion
2
For yjx normal with variance , up to a constant, this is
2
N (log N )
BIC = 2
err + df Yb
N
and after switching 2 for log N , BIC is a multiple of AIC. The replacement of
2 with log N means that when used to guide model/predictor selections, BIC
will typically favor simpler models/predictors than will AIC.
The Bayesian origins of BIC can be developed as follows. Suppose (as in
Section 11.1) that M models are under consideration, the mth of which has
parameter vector m and corresponding density for training data
fm (T j m)
(m)
Under 0-1 loss and uniform ( ), one wants to choose model m maximizing
Z
fm (T j m ) gm ( m ) d m = fm (T ) = the mth marginal of T
199
16.3 Cross-Validation Estimation of Err
K-fold cross-validation is described in Section 1.3.6. One hopes that
N
^ 1 X
CV f = L f^k(i) (xi ) ; yi )
N i=1
CV f^
as a function of , try to optimize, and then re…t (with that ) to the whole
training set.
K-fold cross-validation can be expected to estimate Err for
1
"N " = 1 N
K
N N
!2
1 X 2 1 X yi f^ (xi )
CV f^ = yi f^i (xi ) =
N i=1 N i=1 1 Mii
(for Mii the ith diagonal element of M ). The so-called generalized cross-
validation approximation to this is the much more easily computed
N
!2
1 X yi f^ (xi )
GCV f^ =
N i=1 1 tr (M ) =N
err
= 2
(1 tr (M ) =N )
200
2
It is worth noting (per HTF Exercise 7.7) that since 1= (1 x) 1 + 2x for
x near 0,
N
!2
1 X yi f^ (xi )
GCV f^ =
N i=1 1 tr (M ) =N
N N
!
1 X 2 2 1 X 2
yi f^ (xi ) + tr (M ) yi f^ (xi )
N i=1 N N i=1
which is close to AIC, the di¤erence being that here 2 is being estimated based
on the model being …t, as opposed to being estimated based on a low-bias/large
model.
predictor f^ b based on T b
It’s not completely clear what to make of this. For one thing, the T b rarely
have N distinct elements. In fact, the expected number of distinct cases in
a bootstrap sample for N of any appreciable size is about :632N . So roughly
(1)
d to estimate Err at :632N , not at N . So unless
speaking, we might expect Err
Err as a function of training set size is fairly ‡at to the right of :632N , one might
expect substantial positive bias in it as an estimate of Err (at N ).
HTF argue for
(:632) (1)
d
Err d
:368 err + :632 Err
as a …rst order correction on the biased bootstrap estimate, but admit that
this is not perfect either, and propose a more complicated …x (that they call
(:632+)
d
Err ) for classi…cation problems.
201
Part V
Unsupervised Learning Methods
17 Some Methods of Unsupervised Learning
As we said in Section 1.1, "supervised learning" is basically prediction of y
belonging to < or some …nite index set from a p-dimensional x with coordinates
each individually in < or some …nite index set, using training data pairs
yb = fb(x)
This is one kind of discovery and exploitation of structure in the training data.
As we also said in Section 1.1, "unsupervised learning" is discovery and
quanti…cation of structure in
0 0 1
x1
B x02 C
B C
X =B . C
N p @ .. A
x0N
s1 ; s2 ; : : : ; sp
p
so that one could think of x taking values in f0; 1g , xj = 1 indicating presence
of item j in the transaction. For two disjoint sets of items
202
In applications of this formalism to "market-basket analysis" it is common
to call S; S1 ; and S2 item sets and the statement
"the transaction includes all of both item set S1 and item set S2 "
1. the support of the rule (also the support of the item set S) is
N
1 X
Ii1 Ii2
N i=1
(the relative frequency with which the full item set is seen in the data-
base/training cases),
(the relative frequency with which the full item set S is seen in the training
cases that exhibit the smaller item set S1 ),
3. the "expected con…dence" of the rule is
N
1 X
Ii2
N i=1
(the relative frequency with which item set S2 is seen in the training cases),
and
4. the lift of the rule is
PN
conf idence N i=1 Ii1 Ii2
= P PN
expected conf idence N
i=1 Ii1 i=1 Ii2
(a measure of association).
203
If one thinks of the cases in the training set as a random sample from some
distribution on item sets (equivalently, a distribution for x), lets I1 stand for
the event that all items in S1 are in the set, I2 stand for the event that all items
in S2 are in the set, and I stand for the event that all items in S = S1 [ S2 are
in the set, then
The basic thinking about association rules seems to be that usually (but
perhaps not always) one wants rules with large support (so that the estimates
can be reasonably expected to be reliable). Further, one then wants large
con…dence or lift, as these indicate that the corresponding rule will be useful in
terms of understanding how the coordinates of x (presence or absence of various
items) are related in the database/training data. Apparently, standard practice
is to identify a large number of promising item sets and association rules, and
make a database of association rules that can be queried in searches like:
"Find all rules in which YYY is the consequent that have con…dence
over 70% and support more than 1%."
Basic questions that we have to this point not addressed are where one gets
appropriate item sets S and how one uses them to produce (S1 and S2 and)
corresponding association rules. In answer to the second of these questions,
one might say "consider all 2jSj 2 association rules that can be associated with
a given item set." But what then are "interesting" item sets S or how does
one …nd a potentially useful set of such? We proceed to brie‡y consider these
issues.
1
# fi j xij = 1g
N
204
at least t and place them in the set
S1t = fitem sets of size 1 with support at least tg
have support/prevalence
1
# fi j xij xij 0 = 1g
N
at least t and place them in the set
S2t = fitem sets of size 2 with support at least tg
..
.
8 9
<mz }| { =
1 entries
t
m. For each sj ; sj 0 ; : : : 2 Sm 1 check to see which m-element item sets
: ;
205
17.2 Clustering
Typically (but not always) the object in "clustering" is to …nd natural groups
of rows or columns of 0 0 1
x1
B x02 C
B C
X =B . C
N p @ .. A
x0N
(in some contexts one may want to somehow …nd homogenous "blocks" in a
properly rearranged X). Sometimes all columns of X represent values of
continuous variables (so that ordinary arithmetic applied to all its elements
is meaningful). But sometimes some columns correspond to ordinal or even
categorical variables. In light of all this, we will let xi i = 1; 2; : : : ; r stand for
"items" to be clustered (that might be rows or columns of X) with entries that
need not necessarily be continuous variables.
In developing and describing clustering methods, it is often useful to have
a dissimilarity measure d (x; z) that (at least for the items to be clustered and
perhaps for other possible items) quanti…es how "unalike" items are. This
measure is usually chosen to satisfy
1. d (x; z) 0 8x; z
Where 1-4 hold, d is a "metric." Where 1-3 hold and the stronger condition 40
holds, d is an "ultrametric."
In a case where one is clustering rows of X and each column of X contains
values of a continuous variable, a squared Euclidean distance is a natural choice
for a dissimilarity measure
p
X
2 2
d (xi ; xi0 ) = kxi xi0 k = (xij x i0 j )
j=1
d (xj ; xj 0 ) = 1 jrjj 0 j
206
When dissimilarities between r items are organized into a (non-negative
symmetric) r r matrix
D = (dij ) = (d (xi ; xj ))
with 0s down its diagonal, the terminology "proximity matrix" is often used.
For some clustering algorithms and for some purposes, the proximity matrix
encodes all one needs to know about the items to do clustering. One seeks a
partition of the index set f1; 2; : : : ; rg into subsets such that the dij for indices
within a subset are small (and the dij for indices i and j from di¤erent subsets
are large).
1 X
c1k = 0
I k 0 (i) = k xi
# of i with k (i) = k
over choice of l (creating K clusters around the centers) and replaces all of the
cm
k
1
with the corresponding cluster means
1 X
cm
k = I km 1
(i) = k xi
# of i with k m 1 (i) =k
this context, a natural choice of d (x; z) is kx zk2 : A fancier option might be built
4 3 In
207
for c1 ; c2 ; : : : ; cK the …nal means produced by the iterations.44 One may then
consider the (monotone) sequence of Total Within-Cluster Dissimilarities and
try to identify a value K beyond which there seem to be diminishing returns for
increased K.
A more general version of this algorithm (that might be termed a K-medoid
algorithm) doesn’t require that the entries of the xi be values of continuous
variables, but (since it is then unclear that one can even evaluate, let alone
…nd a general minimizer of, d (xi ; )) restricts the "centers" to be original items.
This algorithm begins with some set of K distinct "medoids" c01 ; c02 ; : : : ; c0K that
are a random selection from the r items xi (subject to the constraint that they
are distinct). One then assigns each xi to that medoid c0k0 (i) minimizing
d xi ; c0l
over choice of l (creating K clusters associated with the medoids) and replaces
all of the c0k with c1k the corresponding minimizers over the xi0 belonging to
cluster k of the sums X
d (xi ; xi0 )
i with k0 (i)=k
over choice of l (creating K clusters around the medoids) and replaces all of the
cm
k
1
with cm
k the corresponding minimizers over the xi0 belonging to cluster k
of the sums X
d (xi ; xi0 )
i with km 1 (i)=k
208
1. D (C1 ; C2 ) = min fdij ji 2 C1 and j 2 C2 g (this is the "single linkage" or
"nearest neighbor" choice),
2. D (C1 ; C2 ) = max fdij ji 2 C1 and j 2 C2 g (this is the "complete linkage"
choice), or
P
3. D (C1 ; C2 ) = #C11#C2 i2C1 ; j2C2 dij (this is the "average linkage" choice).
209
17.2.3 (Mixture) Model-Based Methods
A completely di¤erent approach to clustering into K clusters is based on use
of mixture models. That is, for purposes of producing a clustering, one might
consider acting as if items x1 ; x2 ; : : : ; xr are realizations of r iid random vectors
with parametric marginal density
K
X
q (xj ; 1; : : : ; K) = k p (xj k ) (182)
k=1
PK
for probabilities k > 0 with k=1 k = 1, a …xed parametric density p (xj ),
and parameters 1 ; : : : ; K . (Without further restrictions the family of mixture
distributions speci…ed by density (182) is not identi…able, but we’ll ignore that
fact for the moment.)
A useful way to think about this formalism is in terms of a K-class clas-
si…cation model where values of y are latent/unobserved/completely …ctitious.
This produces density (182) as the marginal density of x. Further, in the model
including a latent y
k p (xj k )
P [y = kjx] = PK
k=1 k p (xj k )
This is the set of xi that would be classi…ed to class k by the optimal (Bayes)
classi…er.
In practice, ; 1 ; : : : ; K must be estimated and estimates used in place of
parameters in de…ning clusters. That is, an implementable clustering method
is to de…ne cluster k (say, Ck ) to be
will have multiple maxima, using any maximizer for an estimate of the parameter
vector will produce the same set of clusters (183). It is common to employ the
"EM algorithm" in the maximization of L ( ; 1 ; : : : ; K ) (the …nding of one
of many maximizers) and to include details of that algorithm in expositions
of model-based clustering. However, strictly speaking, that algorithm is not
intrinsic to the basic notion here, namely the use of the clusters in display (183).
210
17.2.4 Biclustering
An interesting and often useful variant of the clustering problem is one in which
a doubly indexed set of observations xij for i = 1; 2; : : : ; I and j = 1; 2; : : : ; J
(that might be thought of as laid out in an I J two-way array or table) needs
to be simultaneously be put into R (row) clusters over index i and C (columns)
clusters over index j in such a way that the R C cells are each homogeneous.
Figure 42 portrays an I = 6 by J = 12 toy example with values of 72 univariate
xij portrayed in "heat map" fashion. The object of simple biclustering is to
regroup/rearrange rows and columns to make groups producing homogeneous
"cells." We’ll use the notation r (i) for the row cluster index for data row i and
c (j) for the column cluster index for data column j.
1. One begins with some clustering of rows of the data matrix into R clusters
and columns of the data matrix into C clusters, and computes for each
(r; c) "cell" a sample mean of xij s with r (i) = r and c (j) = c (with row
i in row cluster r and column j in column cluster c) creating an initial
matrix M .
2. For each r = 1; 2; : : : ; R one makes a new (J-dimensional) row vector
"center" v r with jth entry mrc(j) assigned and re-clusters all rows in
"K-means" fashion (assigning each row of values xij to the closest center
using squared Euclidean <J distance). With this new row clustering one
recomputes the matrix of means M .
211
3. For each c = 1; 2; : : : ; C one makes a new (I-dimensional) column vector
"center" wc with ith entry mr(i)c and re-clusters all columns in "K-means"
fashion (assigning each column of values xij to the closest center using
squared Euclidean <I distance). With this new column clustering one
recomputes the matrix of means M .
P 2
4. If i;j xij mr(i)c(j) is small and/or has ceased to decline with itera-
tions, the algorithm terminates. Otherwise it returns to step 2.
Various "tweaks" are applied to this algorithm to deal with the eventuality
that row or column clusters go empty. Multiple random starts are employed in
the search of a good biclustering. The issue of what R and C should be used
involves weighing complexity (large numbers of clusters) against a small value
of the cell inhomogeneity criterion of step 4. All of this said, the algorithm
is simple and e¤ective, and appropriate modi…cation of it allows the direct
handling of even cases where not every cell of the I J table is full.
Chakraborty’s Bayes analysis then sets priors of independence for the vector
r = (r (1) ; r (2) ; : : : ; r (I)), the vector c = (c (1) ; c (2) ; : : : ; c (J)), and the R C
means rc .
A useful prior distribution for the means rc is one of iid N 0; 2 variables
for a parameter 2 > 0. Useful priors for r and c are based on "Polya urn
schemes." Take the case of r. Let
I + # [r (i) = r for i = 1; 2; ;I 1]
g (r (I) = rjr (1) ; r (2) ; : : : ; r (I 1)) =
+I 1
The case of a prior h (c) is completely analogous. The parameters 2 ; 2 ; and
are treated as tuning parameters for the analysis.
This probability structure admits very simple Gibbs MCMC sampling and
provides iterates from the posterior distribution over all of the means and (more
212
importantly) over the biclustering speci…ed by the pair (r; c). For a given pair
(r; c), rows i and i0 with r (i) = r (i0 ) are clustered together, and columns j
and j 0 with with c (j) = c (j 0 ) are clustered together. Observations xij and
xi0 j 0 with both r (i) = r (i0 ) and c (j) = c (j 0 ) are in the same "cell" of the two-
way clustering. The MCMC provides (through simple relative frequencies for
j
iterates (r; c) ) approximate posterior probabilities that each pair of rows, each
pair of columns, and each pair observations belong together in a clustering.
There are various ways to make use of the iterates representing the posterior
j
distribution. One is to carry along with MCMC iterates (r; c) iterates of the
means matrix M (from the Li et al. algorithm) and identify an iterate with
P 2
minimum i;j xij mr(i)c(j) , using that iterate to represent the posterior
distribution. Another (preferable) option is to identify a "central" iterate as
follows. For two pairs (r; c) and (r ; c ) one measure of their total disagreement
in clustering of the xij s is
L ((r; c) ; (r ; c ))
X
= I [r (i) = r (i0 ) and c (j) = c (j 0 )] I [r (i) 6= r (i0 ) or c (j) 6= c (j 0 )]
(i;j);(i0 ;j 0 )
X
+ I [r (i) 6= r (i0 ) or r (j) 6= c (j 0 )] I [r (i) = r (i0 ) and c (j) = c (j 0 )]
(i;j);(i0 ;j 0 )
the total number of xij clustered together by only one of the two associated
biclusterings. For …xed (r; c) one might take
j
L ((r; c)) = L (r; c) ; (r; c)
213
for other contexts, modeling of censoring mechanisms provides Bayes analyses
where missingness is informative about the value of an unobserved xij .
Kohonen’s Algorithms One begins with some set of initial cluster centers
z 0lm l=1;:::;L and m=1;:::;M . This might be a random selection (without replace-
ment or the possibility of duplication) from the set of items. It might be a set
of grid points in the 2-dimensional plane in <p de…ned by the …rst two princi-
pal components of the items fxi gi=1;:::;r . And there are surely other sensible
214
possibilities. Then de…ne neighborhoods on the L M grid, N (l; m), that
are subsets of the grid "close" in some kind of distance (like regular Euclidean
distance) to the various elements of the L M grid. N (l; m) could be all of
the grid, (l; m) alone, all grid points (l0 ; m0 ) within some constant 2-dimensional
Euclidean distance of (l; m), etc. Then de…ne a weighting function on <p , say
w (kxk), so that w (0) = 1 and w (kxk) 0 is monotone non-increasing in kxk.
For some schedule of non-increasing positive constants 1 > 1 2 3n o,
j
the SOM algorithms de…ne iteratively sets of cluster centers/prototypes z lm
for j = 1; 2; : : :.
At iteration j, an "online" version of SOM selects (randomly or perhaps in
turn from an initially randomly set ordering of the items) an item xj and
j 1
1. identi…es the center/prototype z lm closest to xj in to <p , call it bj with
j
corresponding grid coordinates (l; m) (Izenman calls bj the "BMU" or
best-matching-unit),
j
2. adjusts those z jlm1 with index vectors belonging to N (l; m) (close to
j
the BMU index vector on the 2-dimensional grid) toward x by the pre-
scription
j 1
z jlm = z jlm1 + j w z lm bj xj z jlm1
(adjusting those centers di¤erent from the BMU potentially less dramati-
cally than the BMU), and
j
3. for those z jlm1 with index pairs (l; m) not belonging N (l; m) sets
z jlm = z jlm1
iterating to convergence.
n At oiteration
n j,
o a "batch" version of SOM updates all centers/prototypes
z jlm1 to z jlm
as follows. For each z jlm1 , let Xlm
j 1
be the set of items for
n o
which the closest element of z jlm1 has index pair (l; m). Then update z jlm1
j 1
as some kind of (weighted) average of the elements of [(l;m)0 2N (l;m) X(l;m) 0 (the
set of xi closest to prototypes with labels that are 2-dimensional grid neighbors
of (l; m)). A natural form of this is to set (with xj(l;m)
1
the obvious sample mean
j 1
of the elements of Xlm )
P j 1 j 1 j 1
(l;m)0 2N (l;m) w z lm z (l;m) 0 x(l;m) 0
z jlm = P
(l;m)0 2N (l;m) w z jlm1 z j(l;m)
1
0
215
like di¤erent limits, but are really completely equivalent). Beyond this, what is
provided by the 2-dimensional layout of indices of prototypes is not immediately
obvious. It seems to be fairly common to compare an error sum of squares for
a SOM to that of a K = L M means clustering and to declare victory if the
SOM sum is not much worse than the K-means value.
and set 0 1
(u; v) 1
B .. C
(u; v) = @ . A
p (u; v)
(u; v) then de…nes a continuous random map <2 ! <p . For L M points
= (l; m) on an integer grid in <2 take (l; m) as the center of a data-generating
mechanism in <p . Then assume that x1 ; : : : ; xr are iid as follows. First, one
of the L M …xed points = (l; m) on the grid of interest is chosen at random,
and then conditioned on this choice
x MVNp ( ( ) ; )
Upon supplying suitable (values of or) prior distributions for the parameters of
the p Gaussian processes and priors for the covariance matrices l;m , MCMC
will for observable x1 ; : : : ; xr and corresponding latent 1 ; : : : ; r produce sam-
ples from a posterior distribution over all of
1; 2; : : : ; r
What are of most interest are the grid points for the r cases, 1 ; : : : ; r . Two
cases xi and xi0 belong to the the same cluster if i = i0 . The MCMC
provides relative frequencies that approximate posterior probabilities that case
i and case i0 belong together, P [ i = i0 ]. That is, one obtains an estimate C ^
of the matrix
C = (P [ i = i0 ]) i=1;2;:::;r
r r i0 =1;2;:::;r
through MCMC relative frequencies. What one is then led to seek as a …nal
work product is an assignment of data points to grid points that
216
1. is consistent with C, and
2. (at least locally) more or less preserves relative distances between clusters
in <p in terms of distances between corresponding grid points in <2 .
For a potential assignment of data points to grid points (that maps
f1; 2; : : : ; rg to the set of indices = (i; j) in the grid) we consider two types of
penalties, one for inconsistency with C and another for failure to preserve dis-
tances. First consider disagreement with C. A measure of disparity between
partitions of f1; 2; : : : ; rg corresponding to 1 ; : : : ; r and to 1 ; : : : ; r is for
a > 0 and b > 0
X
L (( 1 ; : : : ; r ) ; ( 1 ; : : : ; r )) = aI [ i = i0 and i 6= i0 ]
i<i0
X
+ bI [ i 6= i0 and i = i0 ]
i<i0
Mdata = max
0
kxi xi0 k
i;i
And de…ne for r 2 f1; 2; : : : ; Kg the sets NK consisting of those pairs i and i0
such that at least one of the points xi and xi0 is in the K-nearest neighborhood
of the other. Then, a "local multi-dimensional scaling" type penalty45 to apply
to a potential assignment of data points to grid points is
8 X 2 9
>
> kxi xi0 k k i i0 k >
>
>
> Md a t a Mg r i d >
>
>
> 0 >
>
1 < i<i
0
s.t. =
R2 (( 1 ; : : : ; r ) ; K; ) = 2 ( i;i )2NK
X k
K >> i i0 k >
>
>
> Mg r i d >
>
>
> 0 >
>
: i<i
0
s.t. ;
(i;i )2N= K
217
for a > 0. (The …rst term penalizes failure to preserve local relative distances
and the second encourages separation of mappings to points on the grid that
are not neighbors in the <p dataset.)
So, in looking for a map that is consistent with the posterior distribution
and preserves local relative distances, a risk/…gure of merit is for > 0
R ( 1; : : : ;
^
r ) ; C; ; K; ; = R1 ( 1; : : : ;
^
r ) ; C; + R2 (( 1; : : : ; r ) ; K; )
^ ; K; ;
Exact optimization of R ( 1 ; : : : ; r ) ; C; by choice of ( 1 ; : : : ; r )
is in general an NP-hard problem and is thus rarely possible. What is possible
and seems to work remarkably well is to make a long MCMC run (making one’s
^ reliable) and then look for an MCMC iterate j
estimate C ; : : : ; jr with the
1
j ^ ; K; ; . The dissertation of Zhou provides
j
best value of R 1; : : : ;
; C; r
substantial examples of the e¤ectiveness of this strategy. The Bayes model
behind the MCMC simply tends to concentrate the posterior (and thus make
iterates) in a manner consistent with the clustering and distance preservation
goals of SOM.
The famous "Wines" dataset has p = 13 chemical characteristics of r = 178
wine samples from 3 di¤erent cultivars (59 (red) samples. 71 (blue) samples, and
48 (violet) of the three types indexed 1-59, 60-130, and 131-178 respectively).
Figure 44 is a graphical (grey-scale) representation of C^ and a corresponding
j j
best iterate 1; : : : ; r .
218
might, but do not necessarily, come from Euclidean distances among N data
vectors x1 ; x2 ; : : : ; xN in <p .) The object of multi-dimensional scaling is to (to
the extent possible) represent the N items as points z 1 ; z 2 ; : : : ; z N in <q with
kz i zj k dij
This criterion treats errors in reproducing big dissimilarities exactly like it treats
errors in reproducing small ones. A di¤erent point of view would make faith-
fulness to small dissimilarities more important than the exact reproduction of
big ones. The so-called Sammon mapping criterion
X (dij kz i
2
z j k)
SSM (z 1 ; z 2 ; : : : ; z N ) =
i<j
dij
(an index pair is in the set if one of the items is in the k-nearest neighbor neigh-
borhood of the other). Then a stress function that emphasizes the matching of
small dissimilarities and not large ones is (for some choice of > 0)
X 2
X
SL (z 1 ; z 2 ; : : : ; z N ) = (dij kz i z j k) kz i z j k
i<j and (i;j)2Nk i<j and (i;j)2N
= k
Another version of MDS begins with similarities sij (rather than with dis-
similarities dij ). (One important special case of similarities derives from vectors
x1 ; x2 ; : : : ; xN in <p through centered inner products sij hxi x; xj xi.)
A "classical scaling" criterion is
X 2
SC (z 1 ; z 2 ; : : : ; z N ) = (sij hz i z; z j zi)
i<j
HTF claim that if in fact similarities are centered inner products, classical scal-
ing is exactly equivalent to principal components analysis.
219
The four scaling criteria above are all "metric" scaling criteria in that the
distances kz i z j k are meant to approximate the dij directly. An alternative
is to attempt minimization of a non-metric stress function like
P 2
i<j ( (dij ) kz i z j k)
SNM (z 1 ; z 2 ; : : : ; z N ) = P 2
i<j kz i zj k
for k k1 the l1 norm on <p and constants 0 and 1 0. The last term in this
expression is analogous to the lasso penalty on a vector of regression coe¢ cients
4 6 We put quotes on "direction" because in this formulation v will typically not be a unit
vector.
220
as considered in Section 3.1.2, and produces the same kind of tendency to "0
out" entries that we saw in that context. If 1 = 0, v, is proportional to the
ordinary …rst principal component direction. In fact, if = 1 = 0 and N > p,
v = the ordinary …rst principal component direction is the optimizer.
For multiple components, an analogue of the …rst case is a set of K vectors
v k 2 <p organized into a p K matrix V that is part of a minimizer (over
choices of p K matrices V and p K matrices with 0 = I) of the
criterion
XN XK K
X
2 2
xi V 0 xi + kv k k + 1k kv k k1 (185)
i=1 k=1 k=1
for constants 0 and 1k 0. Zou has apparently provided e¤ective
algorithms for optimizing criteria (184) or (185).
or maximize
p
N X
X
xij ln (W H)ij (W H)ij
i=1 j=1
over non-negative choices of W and H, and various algorithms for doing these
have been proposed. (Notice that the second of these criteria is an extension
of a loglikelihood for independent Poisson variables with means entries in W H
to cases where the xij need only be non-negative, not necessarily integer.)
While at …rst blush this enterprise seems sensible, there is a lack of unique-
ness in a factorization producing a product W H, and therefore how to inter-
pret the columns of one of the many possible W s is not clear. (An easy way
to see the lack of uniqueness is this. Suppose that all entries of the product
W H are positive. Then for E a small enough (but not 0) matrix, all entries
1
of W W (I + E) 6= W and H (I + E) H 6= H are positive, and
W H = W H.) Lacking some natural further restriction on the factors W
and H (beyond non-negativity) it seems the practical usefulness of this basic
idea is also lacking.
221
17.4.3 Archetypal Analysis
Another approach to …nding an interpretable factorization of X was provided
by Cutler and Breiman in their "archetypal analysis." Again one means to write
X W H
N p N rr p
for appropriate W and H. But here two restrictions are imposed, namely that
1. the rows of W are probability vectors (so that the approximation to X
is in terms of convex combinations/weighted averages of the rows of H),
and
2. H = B X where the rows of B are probability vectors (so that the
r p r NN p
rows of H are in turn convex combinations/weighted averages of the rows
of X).
The r rows of H = BX are the "prototypes" (?archetypes?) used to represent
the data matrix X.
With this notation and restrictions, (stochastic matrices) W and B are
chosen to minimize
2
kX W BXk
It’s clearly possible to rearrange the rows of a minimizing B and make corre-
2
sponding changes in W without changing kX W BXk . So strictly speaking,
the optimization problem has multiple solutions. But in terms of the set of rows
of H (a set of prototypes of size r) it’s possible that this optimization problem
often has a unique solution. (Symmetries induced in the set of N rows of X
can be used to produce examples where it’s clear that genuinely di¤erent sets of
2
prototypes produce the same minimal value of kX W BXk . But it seems
likely that real datasets will usually lack such symmetries and lead to a single
optimizing set of prototypes.)
Emphasis in this version of the "approximate X" problem is on the set of
prototypes as "representative data cases." This has to be taken with a grain of
salt, since they are nearly always near the "edges" of the dataset. This should
be no surprise, as line segments between extreme cases in <p can be made to
run close to cases in the "middle" of the dataset, while line segments between
interior cases in the dataset will never be made to run close to extreme cases.
222
so that the sample covariance matrix of the data is
1
X 0X =I
N
Note that the columns of X are then scaled principal components of the (cen-
tered) data matrix and we operate with and on X . (For simplicity of notation,
we’ll henceforth drop the " " on X.) This methodology seems to be an at-
tempt to …nd latent probabilistic structure in terms of independent variables to
account for the principal components.
In particular, in its linear form, ICA attempts to model the N (transposed)
rows of X as iid of the form
xi = A si (186)
p 1 p pp 1
for iid vectors si , where the (marginal) distribution of the vectors si is one of
independence of the p coordinates/components and the matrix A is an unknown
parameter. Consistent with our sphering of the data matrix, we’ll assume
that Covx = I and without any loss of generality assume that the covariance
matrix for s is not only diagonal, but that Covs = I. Since then I =Covx =
A (Covs) A0 = AA0 , A must be orthogonal, and so
A0 x = s
has columns that provide predictions of the N (row) p-vectors s0i , and we might
thus call those the "independent components" of X (just as we term the columns
of XV the principal components of X). There is a bit of arbitrariness in
the representation (186) because the ordering of the coordinates of s and the
corresponding rows of A is arbitrary. But this is no serious concern.
So then, the question is what one might use as a method to estimate A in
display (186). There are several possibilities. The one discussed in HTF
is related to entropy and Kullback-Leibler distance. If one assumes that
a (m-dimensional) random vector Y has a density p with marginal densities
p1 ; p2 ; : : : ; pm then an "independence version" of the distribution of Y has den-
Qm
sity pj and the (non-negative) K-L divergence of the distribution of Y from
j=1
223
its independence version is
0 1
0 1
m
Y Z
B p (y) C
KL @p; pj A = p (y) ln B
@Qm
C dy
A
j=1 pj (yj )
j=1
Z m Z
X
= p (y) ln p (y) dy p (y) (ln pj (yj )) dy
j=1
Z Xm Z
= p (y) ln p (y) dy pj (yj ) (ln pj (yj )) dyj
j=1
m
X
= H (Yj ) H (Y )
j=1
for H the entropy function for a random argument. Since entropy is an inverse
measure of information for a distribution, this K-L divergence is a di¤erence in
the information carried by Y (jointly) and the sum across the components of
their individual information contents. If it is small, one might loosely interpret
the components of Y as approximately independent.
If one then thinks of s as random and of the form A0 x for random x, it is
perhaps sensible to seek an orthogonal A to minimize (for for aj the jth column
of A)
p
X p
X
H (sj ) H (s) = H a0j x H A0 x
j=1 j=1
p
X
= H a0j x H (x) ln jdet Aj
j=1
Xp
= H a0j x H (x)
j=1
for G (u) 1c ln cosh (cu) for a c 2 [1; 2]. Then, criterion (187) has the empirical
approximation
p N
!2
X 1 X
Cb (A) = EG (z) 0
G aj xi 0
j=1
N i=1
224
b can be taken to be an optimizer of
where, as usual, x0i is the ith row of X. A
b
C (A).
Ultimately, this development produces a rotation matrix that makes the p
entries of rotated and scaled principal component score vectors "look as inde-
pendent as possible." This is thought of as resolution of a data matrix into its
"independent sources" and as a technique for "blind source separation."
de…ned on some interval [0; T ], where we assume that the coordinate functions
hj (t) are smooth. With 0 0 1
h1 (t)
B h02 (t) C
B C
h0 (t) = B .. C
@ . A
h0p (t)
the "velocity vector" for the curve, h0 (t) is then the "speed" for the curve
and the arc length (distance) along h(t) from t = 0 to t = t0 is
Z t0
0
Lh (t ) = h0 (t) dt
0
does have unit speed and traces out the same set of points in <p that are traced
out by h(t). So there is no loss of generality in assuming that parametric
curves we consider here are parameterized by arc length, and we’ll henceforth
write h( ).
225
Then, for a unit speed parametric curve h( ) and point x 2 <p , we’ll de…ne
the projection index
This is roughly the last arc length for which the distance from x to the curve is
minimum. If one thinks of x as random, the "reconstruction error"
2
E kx h( h (x))k
(the expected squared distance between x and the curve) might be thought of
as a measure of how well the curve represents the distribution. Of course, for
a dataset containing N cases xi , an empirical analog of this is
N
1 X 2
kxi h( h (xi ))k (190)
N i=1
and a "good" curve representing the dataset should have a small value of this
empirical reconstruction error. Notice however, that this can’t be the only
consideration. If it was, there would sure be no real di¢ culty in running a
very wiggly (and perhaps very long) curve through every element of a dataset
to produce a curve with 0 empirical reconstruction error. This suggests that
with 0 00 1
h1 ( )
B h002 ( ) C
00 B C
h ( )=B .. C
@ . A
00
hp ( )
the curve’s "acceleration vector," there must be some kind of control exercised
on the curvature, h00 ( ) , in the search for a good curve. We’ll note below
where this control is implicitly applied in standard algorithms for producing
principal curves for a dataset.
Returning for a moment to the case where we think of x as random, we’ll say
that h( ) is a principal curve for the distribution of x if it satis…es a so-called
self-consistency property, namely that
T
h0 ( ) = v1
2
226
to create a unit-speed curve that extends past the dataset in both directions
along the …rst principal component direction in <p . Then project the xi onto
the line to get N values
1 T
i = h0 (xi ) = hxi ; v 1 i +
2
and in light of the criterion (191) more or less average xi with corresponding
h0 (xi ) near to get h1 ( ). A speci…c possible version of this is to consider,
for each coordinate, j, the N pairs
1
i ; xij
227
case, and thin plate splines can replace 1-dimensional cubic smoothing splines
for producing iterates of coordinate functions. But ideas of unit-speed don’t
have obvious translations to <2 and methods here seem fundamentally more
complicated than what is required for the 1-dimensional case.
and de…ne
N
X
cj = Lij = the number of directed edges pointed away from node j
i=1
(There is the question of how we are going to de…ne Ljj . We may either declare
that there is an implicit edge pointed from each node j to itself and adopt the
convention that Ljj = 1 or we may declare that all Ljj = 0.)
A node (Web page) might be more important if many other (particularly,
important) nodes have edges (links) pointing to it. The Google PageRanks
ri > 0 are chosen to satisfy47
X Lij
ri = (1 d) + d rj (192)
j
cj
228
we’ll assume that r 0 1 = N so that the average rank is 1. Then, for
8
< 1 if c 6= 0
j
dj = c
: 0j if c = 0
j
r = (1 d) 1 + dLDr
1
= (1 d) 110 + dLD r
N
(using the assumption that r 0 1 = N ). Let
1
T = (1 d) 110 + dLD
N
so that r 0 T 0 = r 0 .
Note all entries of T are non-negative and that
0
1
T 01 = (1 d) 110 + dLD 1
N
1
= (1 d) 110 1 + dDL0 1
N 0 1
c1
B c2 C
B C
= (1 d) 1 + dD B . C
@ .. A
cN
229
as n ! 1.
Part VI
Miscellanea
18 Graphs as Representing Independence Rela-
tionships in Multivariate Distributions
The most coherent approaches to statistical machine learning are ultimately
based on probability models for the generation of all of (x; y) and additionally
"all" other "relevant" unobserved/latent variables. Even for small N , such
multivariate distributions are in general impossibly complicated and impossible
to detail on the basis of a training set. Only by making simplifying assumptions
can progress be made. Graphs are often called upon to organize and represent
useful simplifying assumptions about conditional independence between various
of the variables to be jointly modeled, and their use is sometime treated as an
important part of machine learning.
Random quantities X and Y are conditionally independent given Z
written
X kY jZ
provided densities factor as
1. X k Y j Z ) Y k X j Z,
2. X k Y j Z and U = h (X) ) U k Y j Z,
3. X k Y j Z and U = h (X) ) X k Y j Z; U ,
230
A main goal of this material is representing aspects of large joint distributions
in ways that allow one to "see" conditional independence relationships in graphs
representing them and to construct correspondingly simple joint and conditional
densities for variables. In this section we will provide a brief introduction to
the simplest ideas in this enterprise. More on the topics can be found in
books by Murphy, by Wasserman, and by Lauritzen. We’ll …rst consider what
relationships are typically represented using directed graphs, and then what
relationships are represented using undirected graphs.
231
In Figure 45, X is a parent of Y and an ancestor of W . There is a directed
path from X to W . Y is a child of both X and Z.
For a vector of random quantities (and vertices) X = (X1 ; X2 ; : : : ; Xk ) and
a distribution P for X, it is said that a DAG G represents P (or P is Markov
to G) if and only if densities satisfy
k
Y
pX (x) = p (xi jparentsi ) (193)
i=1
where
parentsi = fparents of Xi in the DAG Gg
So a joint distribution P for (X; Y; Z; W ) is represented by the DAG pictured
in Figure 45 if and only if
X k Z and W k (X; Z) j Y
232
are needed to specify pZ (z) ; 2 values are needed to specify each of 9 di¤erent
conditional pmfs pY jX;Z (yjx; z), and …nally 2 values are needed to specify each
of 3 conditional pmfs pW jY (wjy). That is, there are 2 + 2 + 2 (9) + 2 (3) = 28
probabilities to be speci…ed under form (194), far fewer than in general.
None of this touches the obvious questions of what forms of DAG are ap-
propriate (and why they are so) in particular applications and lead to e¤ective
methods of translating a training set into appropriate estimates for the factors
p (xi jparentsi ) in the expression (193). The question of how to infer the fac-
tors of the product form from training data is particularly perplexing for models
that include latent/hidden/unobserved nodes. Researchers who value this kind
of modeling must obviously produce tractable and believable DAGs and cor-
responding forms for conditional distributions that lead to e¤ective specialized
…tting methods for the kinds of training data they expect to encounter.
4. a clique is a set of vertices of a graph that are all adjacent to each other,
and
5. a clique is maximal if it is not possible to add another vertex to it and
still have a clique.
233
Figure 46: An undirected graph.
k
A pairwise Markov graph for P can be made by considering only pairwise
2
conditional independence questions. But as it turns out, many other conditional
independence relationships can be read from it. That is, it turns out that if
G is a pairwise Markov graph for P , then for non-overlapping sets of vertices
A; B; and C and corresponding subvectors of X respectively X A ; X B , and X C
C separates A and B ) X A k X B j X C (196)
If, for example, Figure 47 is a pairwise Markov graph for a distribution P
for X1 ; X2 ; : : : ; X5 , we may conclude from implication (196) that
(X1 ; X2 ; X5 ) k (X3 ; X4 ) and X2 k X5 j X1
234
so that separation on a pairwise Markov graph is equivalent to conditional in-
dependence.
An important question is "What forms are possible for densities when P
is globally G Markov?" An answer is provided by the famous Hammersley-
Cli¤ord Theorem. This promises that if joint pmf pX (x) > 0 for all x and
fC1 ; C2 ; : : : ; Cm g is the set of all maximal cliques for a pairwise Markov graph
G associated with P , then
m
Y
pX (x) / i (xCi ) (197)
i=1
for some functions i ( ) > 0. A potentially more natural but less parsimonious
representation is that (if again joint pmf pX (x) > 0 for all x)
Y
pX (x) / ij (xi ; xj ) (198)
i<i s.t. Xi Xj
235
there are edges only between nodes on di¤erent layers, not between nodes in
the same layer. One layer of nodes is called the "hidden layer" and the other
is called the "visible layer." Typically the nodes in the visible layer correspond
to (digital versions) of variables that are (at least at some cost) empirically
observable, while the variables corresponding to hidden nodes are completely
latent/unobservable and somehow represent some stochastic physical or mental
mechanism. In addition, it is convenient in some contexts to think of visible
nodes as being of two types, say belonging to a set V1 or a set V2 . For example,
in a prediction context, the nodes in V1 might encode "x"/inputs and the nodes
in V2 might encode y/outputs. We’ll use the naming conventions indicated in
Figure 48.
i j
ij (hi ; vj j ) = exp hi + vj + ij hi vj
m+n l
can be used in form (198) to produce a pmf for (h; v) for which Figure 48
provides a pairwise Markov graph. For this form
0 1
Xl l+m+n
X l l+m+n
X X
p (h; vj ) / exp @ i hi + j vj + ij hi vj
A
i=1 j=l+1 i=1 j=l+1
and thus
Pl Pl+m+n Pl Pl+m+n
exp i=1 i hi + j=l+1 j vj + i=1 j=l+1 ij hi vj
p (h; vj ) = P Pl Pl+m+n Pl Pl+m+n ~
~ v ) exp
~ +
i hi jv
~j + ij hi v
~j
(h;~ i=1 j=l+1 i=1 j=l+1
(199)
236
Let the normalizing constant that is the denominator on the right of display
(199) be called ( ) and note the obvious fact that for these models
l
X l+m+n
X l l+m+n
X X
ln (p (h; vj )) = i hi + j vj + ij hi vj ln ( ( )) (200)
i=1 j=l+1 i=1 j=l+1
l + m + n + l (n + m) = l + (l + 1) (n + m)
The …rst of these issues is well-recognized. The second and third seem far less
well-appreciated and make these models often less than ideal for representing
observed real variation in v.
What will be available as training data for an RBM is some set of (poten-
tially incomplete) vectors of values for visible nodes, say v i for i = 1; : : : ; N
(that one will typically assume are independent and from some appropriate
marginal distribution for visible vectors derived via summation from the over-
all joint distribution of values associated with all nodes visible and hidden).
Notice now that even in a hypothetical case where one has "data" consisting
of complete (h; v) pairs, the existence of the unpleasant normalizing constant
Q
N
( ) would typically make optimization of a likelihood p (hi ; v i j ) or log-
i=1
PN
likelihood i=1 ln p (hi ; v i j ) problematic. But the fact that one must sum out
over (at least) all hidden nodes in order to get contributions to a likelihood or
237
loglikelihood makes the problem even more computationally di¢ cult. That is,
if an ith training case provides a complete visible vector v i , the corresponding
likelihood term is the marginal of that visible con…guration
X
p (v i j ) = ~ vi j
p h;
~
h
And the computational situation becomes even more unpleasant if an ith train-
ing case provides, for example, only values for variables corresponding to nodes
in V1 (say v 1i ), since the corresponding likelihood term is the marginal of only
that visible con…guration
X
p (v 1i j ) = p h;~ (v 1i ; v
~2 ) j
~ and v
h ~2
Substantial e¤ort in computer science circles has gone into the search for
"learning" algorithms aimed at …nding parameter vectors that produce large
values of loglikelihoods based on N training cases (each term based on some
marginal of p (h; vj ) corresponding to a set of visible nodes). These seem
to be mostly based on approximate stochastic gradient descent ideas and ap-
proximations to appropriate expectations based on short Gibbs sampling runs.
Hinton’s notion of "contrastive divergence" appears to be central to the most
well known of these. Work of Kaplan et al. calls into question even the possi-
bility of completely rational means of …tting Boltzmann machines by appeal to
any standard statistical principles.
Beyond the …tting problem is the nature of many …tted Boltzmann Machines.
These seem to typically seriously under-represent the kind of variability seen in
v i s in training sets and have marginals p (vj ) that are highly sensitive to small
(one-coordinate) changes in v. For some theory in this direction, see again
work of Kaplan et al. These issues seem to be real drawbacks to the use of
RBMs in data modeling.
Some of what is termed "deep learning" seems to be based on the notion of
generalizing RBMs to more than a single hidden layer and potentially employs
visible layers on both the top and the bottom of an undirected graph. A cartoon
of a deep learning network is given in Figure 49. The fundamental feature of the
graph architecture here is that there are edges only between nodes in successive
layers.
If the problem of how to …t parameters for a RBM is a practically di¢ -
cult/?impossible? one, …tting a deep network model (based on some training
cases consisting of values for variables associated with some of the visible nodes)
is clearly going to be doubly problematic. What seems to be currently popular
is some kind of "greedy"/"forward" sequential …tting of one set of parameters
connecting two successive layers at a time, followed by generation of simulated
values for variables corresponding to nodes in the "newest" layer and treating
those as "data" for …tting a next set of parameters connecting two layers (and so
on). But a deep learning network like that portrayed on Figure 49 compounds
238
Figure 49: An hypothetical "deep" generalization of a RBM.
the kinds of …tting issues raised for RBMs and principled methods seem lacking.
Further, degeneracy and instability issues like 2. and 3. above also are manifest.
If the …tting issues for a deep network were solvable for a given architecture
and form of training data, some interesting possibilities for using a good "deep"
model have been raised. For one, in a classi…cation/prediction context, one
might treat the bottom visible layer as associated with an input vector x, and the
top layer as also visible and associated with an output y or y. In theory, training
cases could consist of complete (x; y) pairs, but they could also involve some
incomplete cases, where y is missing, or part of x is missing, or ... so on. (Once
the training issue is handled, simulation of the top layer variables from inputs
of interest used as bottom layer values again enables classi…cation/prediction.)
As an interesting second possibility for using deep structures, one could con-
sider architectures where the top and bottom layers are both visible and encode
essentially (or even exactly50 ) the same information and between them are sev-
eral hidden layers with one having very few nodes (like, say, two nodes). If one
can …t such a thing to training cases consisting of sets of values corresponding
to visible nodes, one can simulate (again via Gibbs sampling) for a …xed set
of "visible values" corresponding values at the few nodes serving as the narrow
layer of the network. The vector of estimated marginal probabilities of a latent
value 1 at those nodes might then in turn serve as a kind of pair of "generalized
principal component" values for the set of input visible values. (Notice that in
5 0 I am not altogether sure what are the practical implcations of essentially turning the kind
of "tower" in Figure 49 into a "band" or ‡at bracelet that connects back on itself, the top
hidden layer having edges directly to the bottom visible layer, but I see no obvious prohibition
of this architecture.
239
theory these are possible to compute even for incomplete sets of visible values.)
N 0; B 2 + (1 ) 0 or U ( B; B) + (1 ) 0
This kind of prior distribution is obviously symmetric in j and puts much of its
prior probability on hyperplanes where some of the entries of are 0. Cor-
5 1 The support vector machine idea is an instance of this, where a relatively few "support
vectors" of N possible training vectors are represented in formulas for optimal linear voting
functions, because all others have coe¢ cients that are "zeroed out" in the …tting process
(this ultimately corresponding putting parameter vectors on 0-coordinate hyperplanes in an
(N + 1)-dimensional parameter space).
5 2 The fact that a posterior density is proportional to the product of the likelihood function
and a prior density implies that the way to get posterior sparsity is to use prior distributions
that put most of their mass on or near "simple sub-model" parts of the parameter space.
240
responding posterior distributions based on a training set typically concentrate
on and near those hyperplanes.
The lasso development of Section 3.1.2 and the notion that penalization in
normal data models is strongly related to use of priors whose log densities are
proportional to the penalty functions suggests another possibility. That is the
use of priors for that for a single > 0 and an a single exponent 0 < r 1
make its entries iid with each j having density proportional to
r
exp ( j jj )
(The r = 1 case is that of independent doubly exponential priors for the entries
of .) For large this symmetric prior again concentrates much of its mass near
hyperplanes where some of the entries of are 0.
A third sparsity-inducing prior for employs a set of q additional hyper-
parameters 1 ; : : : ; q and conditional on these makes the entries of independent
with j N 0; 1= 2j (the variance is 1= 2j ). Then using independent proper
hyper-priors
j (a; b) for small positive a and b
or independent Je¤reys improper hyper-priors with
1
ln j U (<)
2
or proper approximations to the Je¤reys priors like
1
ln j U ( B; B) for large B
2
often posteriors for many of the j have large mass far from 0 and correspond-
ingly encode large concentration of mass for many of the j near 0.
The third possibility above has some corresponding analytical results that
can be used to approximate posteriors, but the most direct way of processing a
training set and prior distribution to produce usable posterior results is through
the use of standard Bayes MCMC software. In SEL prediction problems, one
can for example combine a prior distribution for with a likelihood derived
from a model for independent
2
yi N 0 + (h1 (xi ) ; : : : ; hq (xi )) ;
(and appropriate priors for 0 and 2 ). In 2-class classi…cation problems (with
0-1 coding) one can combine a prior distribution for with a likelihood derived
from a model for independent
1
yi Bernoulli
1 + exp ( ( 0 + (h1 (xi ) ; : : : ; hq (xi )) ))
(and an appropriate prior for 0 ). In both cases, standard Bayes software can
be used to identify a (typically sparse) high-posterior density parameter vector
^ and then a sensible SEL predictor
241
in the …rst case and a sensible 0-1 loss classi…er
h i
f^ (x) = I ^0 + (h1 (x) ; : : : ; hq (x)) ^ > 0
in the second.
K
X
y
k MVNp+1 ( k; k) (201)
x
k=1
then for p (xj k ; k ) the kth (marginal) component density of x and Ek [yjx]
the kth (linear) conditional mean function, then
K
!
X k p (xj k ; k )
E [yjx] = PK Ek [yjx] (202)
k=1 k=1 k p (xj k ; k )
242
taking values in f1; 2; : : : ; Kg iid with marginal distribution speci…ed by . If
conditioned on ki each
yi
MVNp+1 ki ; ki
xi
the training set has N observations that are iid according to the mixture distri-
bution (201). Then for a prior distribution
Dirichlet ( 1; : : : ; K) (203)
MCMC algorithms for sampling from the posterior distribution of the parame-
ters of the mixture and the latent variables are easy to …nd. Iterates of the
parameters vectors from such an algorithm produce iterates of the functional
form (202) and averaging across iterates can produce a workable predictor. This
works by essentially picking out "the right" regions and linear functions where
linearity of prediction is warranted and by identifying "the right" number of
components (less than or equal to K) to assign appreciable entries in . But it
seems that there are three requirements for the forms g1 ( ) and g2 ( ) in order
for this program to be e¤ective. These are that 1) some kind of conjugacy is
needed in order to make Gibbs sampling applicable and the method practical, 2)
"locations" for g1 ( ) need to be "right" and "scales" for g2 ( ) need to be ‡exible,
and 3) neither "‡at"/uninformative nor very "sharp"/informative distributions
work well for forms g1 ( ) and g2 ( ).
For purposes of making an e¤ective predictor (not for purposes of a philo-
sophically "proper" Bayes analysis) it proves e¤ective to employ g1 ( ) derived
from the training set. One can e¤ectively use a multivariate density estimate
based on the observed vectors in the training set and a spherical normal kernel
N
1 X yi 2
g1 ( ) = p j ; I
N i=1 xi
for an appropriate bandwidth : (One can simulate from this prior by pick-
ing a training case at random and adding to it a MVNp+1 0; 2 I random
perturbation.)
Further (still for purposes of making an e¤ective predictor) it proves e¤ective
to employ for g2 ( ) an equally weighted mixture of inverse Wishart densities with
corresponding "means" 2 I and minimum (namely p + 3) degrees of freedom for
2 f:01; :02; : : : ; 1:00g. This mixture prior allows di¤erent scales for di¤erent
components of form (201) and gives a group of training cases i with common ki
maximum e¤ect on what values of ki have large posterior probability (tending
to make ki look like the group sample covariance matrix).
The product of g1 ( ) and g2 ( ) is then a mixture of joint densities conjugate
in the one-sample multivariate normal problem and is thus easy to handle in
243
Gibbs sampling. Gibbs updating of is similarly easy using the ki s, and
Gibbs updates of those are easily handled because they are discrete. In all,
the data-dependent "prior" distribution speci…ed in displays (203) and (204) is
computationally attractive and leads to e¤ective SEL prediction.
k SBEFp ( k ; k ) (206)
k=1
244
on ki each
xi SBEFp ki ; ki
the training set has N observations that are iid according to the mixture distri-
bution (206). Then for a prior distribution
Dirichlet ( 1; : : : ; K)
where
s
2 p p+1 2 p+2 2p+2
1 (p + 1) + 2p (p + 2) (p + 1) +
g( ) /
(1 p+1 )2 (1 )
2
Part VII
Appendices
A Exercises
A.1 Section 1.2 Exercises
These are exercises intended to provide intuition that data in <p are necessarily
"sparse." The realities are that <p is "huge" and for p at all large, "…lling up"
even a small part of it with data points is e¤ectively impossible and our intuition
about distributions in <p is very poor.
1. (6HW-11) Let Qp (t) and qp (t) be respectively the 2p cdf and pdf. Con-
sider the MVNp (0;I) distribution and Z 1 ; Z 2 ; : : : ; Z N iid with this distribution.
With
M = min fkZ i k ji = 1; 2; : : : ; N g
245
write out a one-dimensional integral involving Qp (t) and qp (t) giving EM .
Evaluate this mean for N = 100 and p = 1; 5; 10; and 20 either numerically
or using simulation.
2. (6HW-13) For each of p = 1; 5; 10; and 20, generate at least 1000 realizations
of pairs of points x and z as iid uniform over the p-dimensional unit ball (the set
of x with kxk 6 1). Compute (for each p) the sample average distance between
x and z. (For Z MVNp (0;I) independent of U U(0; 1) ; x = U 1=p = kZk Z
is uniformly distributed in the unit ball in <p .)
3. (5HW-14) For each of p = 10; 20; 50; 100; 500; and 1000, make n = 10; 000
draws of distances between pairs of independent points uniform in the cube
p
[0; 1] . Use these to make 95% con…dence limits for the ratio
4. (5HW-14) For each of p = 10; 20; 50, make n = 10; 000 random draws of
p
N = 100 independent points uniform in the cube [0; 1] . Find for each sample
of 100 points, the distance from the …rst point drawn to the 5th closest point of
the other 99. Use these to make 95% con…dence limits for the ratio
mean diameter of a 5-nearest neighbor neighborhood if N = 100
maximum distance between two points in the cube
p
5. (5HW-14) What fraction of random draws uniform from the unit cube [0; 1]
p
lie in the "middle part" of the cube [ ; 1 ] , for a small positive number ?
The next 3 problems are based on nice ideas taken from Giraud’s
book.
6. (6HW-15) For p = 2; 10; 100; and 1000, draw samples of size N = 100 from
p
the uniform distribution on [0; 1] . Then for every (xi ;xj ) pair with i < j in
one of these samples, compute the Euclidean distance between the two points,
100
kxi xj k. Make a histogram (one p at a time) of these distances.
2
What do these suggest about how well "local" prediction methods (that rely
only on data points (xi ; yi ) with xi "near" x to make predictions about y at x)
can be expected to work?
246
The p-dimensional volume of a ball of radius r in <p is
p=2
Vp (r) = rp
(p=2 + 1)
Vp (r)
p=2
!1
2 er 2 1=2
p (p )
Then, if N points can be found with corresponding -balls covering the unit
cube in <p , the total volume of those balls must be at least 1. That is
N Vp ( ) > 1
What then are approximate lower bounds on the number of points required to
p
…ll up [0; 1] to within for p = 20; 50; and 200, and = 1; :1; and :01? (Giraud
notes that the p = 200 and = 1 lower bound is larger than the estimated
number of particles in the universe.)
8. (6HW-15) Giraud points out that for large p, most of MVNp (0;I) proba-
bility is "in the tails." For qp (x) the MVNp (0;I) pdf and 0 < < 1 let
n o
Bp ( ) = fxjqp (x) > qp (0)g = xj kxk 6 2 ln
2 1
247
b) The Zillow Kaggle game for predicting (positive) house prices used the
loss function
2
2 y^
L (^
y ; y) = (ln y^ ln y) = ln
y
Identify the function of x, call it f (x), that based on a joint distribution P for
(x; y) optimizes
EL (g (x) ; y)
over choices of function g (x).
1 1 2 2
h1 (x) = I 0 6 x 6 ; h2 (x) = I <x6 ; and h3 (x) = I <x61
3 3 3 3
and the very small training set
Case (i) 1 2 3 4 5 6
yi 0 4 10 12 6 10
xi :1 :3 :4 :6 :7 :9
4. (5E1-18) Use the same training set as in Problem 3 above and without
bothering to center y, …nd the 1-nn SEL predictor for y, say f^1-nn (x), and
evaluate its LOOCV MSPE. (Specify values of the predictor for all x 2 [0; 1]
except where there are "ties."
5. (5HW-18) Consider the Ames House Price dataset and possible predic-
tors of Price. In particular, consider the p = 4 inputs Size, Fireplace, Base-
mentbath, and Land. There are, of course, 24 = 16 possible multiple linear
regression predictors to be built from these features (including the one with no
covariates employed). Use both LOOCV and repeated 8-fold cross-validation
248
implemented through caret train() to compare these 16 predictors in terms
of cross-validation root mean squared prediction errors.
For a 0-1 loss K = 4 classi…cation problem, …nd explicitly and make a plot
showing the 4 regions in the unit square where an optimal classi…er f has f (x) =
k (for k = 1; 2; 3; 4) …rst if = ( 1 ; 2 ; 3 ; 4 ) is (:25; :25; :25; :25) and then if it
is (:2; :2; :3; :3).
0 if no xi is in (:45; :55)
1=3 if one xi is in (:45; :55)
2=3 if two xi s are in (:45; :55)
1 if three or more xi s are in (:45; :55)
What is the value of the bias of the nearest neighbor predictor at :5? Does this
bias go to 0 as N gets big? Argue carefully one way or the other.
249
neighbor classi…er based on these training cases.
10. (6HW-11) Consider SEL prediction. Suppose that in a very simple prob-
lem with p = 1, the distribution P for the random pair (x; y) is speci…ed by
x U (0; 1) and yjx N x2 ; (1 + x)
((1 + x) is the conditional variance of the output). Further, consider two possi-
ble sets of functions S = fgg for use in creating predictors of y, namely
1. S1 = fgjg (x) = a + bx for real numbers a; bg ; and
( )
P
10
2. S2 = gjg (x) = aj I 10 < x 6 10 for real numbers aj
j 1 j
j=1
Training data are N pairs (xi ; yi ) iid P . Suppose that the …tting of elements
of these sets is done by
1. OLS (simple linear regression) in the case of S1 , and
2. according to
8 j 1 j
>
> y if no xi 2 10 ; 10
>
<
a
^j = 1
P
> #xi 2( j101 ; 10
j yi otherwise
>
> ]
: i with
j
xi 2( j101 ; 10 ]
in the case of S2 ,
to produce predictors f^1 and f^2 .
a) Find (analytically) the functions g for the two cases. Use them to …nd
2
the two expected squared model biases Ex (E[yjx] g (x)) . How do these two
compare?
b) For the second case, …nd an analytical form for ET f^2 and then for the
2
average squared …tting bias Ex ET f^2 (x) g (x) . (Hints: What is the con-
2
j 1 j
ditional distribution of the yi given that no xi 2 10 ; 10 ? What is the
conditional mean of y given that x 2 j101 ; 10
j
?)
250
c) For the …rst case, simulate at least 1000 training datasets of size N = 100
and do OLS on each one to get corresponding f^1 s. Average those to get an
approximation for ET f^1 . (If you can do this analytically, so much the better!)
Use this approximation and analytical calculation to …nd the average squared
2
…tting bias Ex ET f^1 (x) g (x) for this case.
1
d) How do your answers for b) and c) compare for a training set of size
N = 100?
e) Use whatever combination of analytical calculation, numerical analysis,
and simulation you need to use (at every turn preferring analytics to numerics
to simulation) to …nd the expected prediction variances Ex VarT f^ (x) for the
two cases for training set size N = 100.
f ) In sum, which of the two predictors here has the best value of Err for
N = 100?
11. (6HW-11) Two …les with respectively N = 100 and then N = 1000 pairs
(xi ; yi ) generated according to P in Problem 10 above are provided with these
notes. Use 10-fold cross validation to see which of the two predictors in Problem
10 looks most likely to be e¤ective. (The datasets will not be sorted, so you
may treat successively numbered groups of 1=10th of the training cases as your
K = 10 randomly created pieces of the training set.)
251
a regression with the set of predictors
1; x; x2 ; x3 ; x4 ; x5 ; sin x; cos x; sin 2x; cos 2x
14. (5E1-14) Consider a joint pdf (for (x; y) 2 (0; 1) (0; 1)) of the form
1 y
p (x; y) = exp for 0 < x < 1 and 0 < y
x2 x2
(x U(0; 1) and conditional on x, the variable y is exponential with mean x2 .)
2
a) Find the linear function of x (say + x) that minimizes E(y ( + x)) .
(The averaging is over the joint distribution of (x; y). Find the optimizing in-
tercept and slope.)
b) Suppose that a training set consists of N data pairs (xi ; yi ) that are
independent draws from the distribution speci…ed above, and that least squares
is used to …t a predictor f^N (x) = aN + bN x to the training data. Suppose
that it’s possible to argue that the least squares coe¢ cients aN and bN converge
(in a proper probabilistic sense) to your optimizers from a) as N ! 1. Then
for large N , about what value of (SEL) training error do you expect to observe
under this scenario?
252
3
2I [x > 3]. Find a linear combination of the best element of S you identi…ed
in a) and this best predictor available to the second learner that is better than
either individual predictor.
16. (6HW-13) Using the datasets provided with these notes carry out the
steps of Problems 10 and 11 above supposing that the distribution P for the
random pair (x; y) is speci…ed by
17. (6HW-15) Using the datasets provided with these notes carry out the
steps of Problems 10 and 11 above supposing that the distribution P for the
random pair (x; y) is speci…ed by
2 2
x U (0; 1) and yjx N (3x 1:5) ; (3x 1:5) + :2
2
(the Gaussian variance is (3x 1:5) + :2).
a) Under this model, what is the best element of S, say g , for predicting
y? Use this to …nd the average squared model bias in this problem.
b) Suppose that based on an iid sample of N points (xi ; yi ), …tting is done
by least squares (and thus the predictor f^ (x) = y is employed). What is the
average squared …tting bias in this case?
c) What is the average prediction error, Err, when the predictor in b) is
employed?
(that make one cut in the real numbers at c and classify one way to the left of
c and the other way to the right of c). Plot as functions of c the risks
253
for classi…ers of the form I [x < c] and
for classi…ers of the form I [x > c]. What is the best element of S (say, g )
and then what is the "modeling penalty" associated with using the class of
predictors/classi…ers S (the di¤erence between the optimal error rate and the
error rate for g )?
d) Suppose that for a training set of size N = 100 (generated at random
from the distribution described in the preamble of this problem), one will choose
a cut point c^ half way between two consecutive sorted xi values minimizing
min [# fyi = 0jxi < cg + # fyi = 1jxi > cg ; # fyi = 1jxi < cg + # fyi = 0jxi > cg]
Then, if
# fyi = 0jxi < c^g + # fyi = 1jxi > c^g 6 # fyi = 1jxi < c^g + # fyi = 0jxi > c^g
one will employ the classi…er f^ (x) = I [x < c^] and otherwise the classi…er f^ (x) =
I [x > c^]. Simulate 10; 000 training samples and …nd corresponding classi…ers f^.
For each f^ compute a (conditional on the training sample) error rate (an average
of two appropriate normal probabilities on half in…nite intervals bounded by c^)
and average across the training samples. What is the "…tting penalty" for this
procedure? Redo this exercise, using a training set of size N = 50. Is the …tting
penalty larger than for N = 100?
20. (6HW-17) Consider the model of Problem 19 above, but change to the
" 1 and 1" coding of classes/values of y.
a) Plot the function g minimizing Eexp ( yg (x)) over all choices of real-
valued g.
Suppose then that one wishes to approximate this minimizer from part a) with
2
a function of the form 0 + 1 (x x)+ 2 (x x) based on a training set. Your
instructor will provide a training set of size N = 100 based on the model of this
problem. Use it in what follows.
b) Use a numerical optimizing routine and identify values ^0 ; ^1 ; ^2 mini-
mizing the empirical average loss
N
1 X 2
R( 0; 1; 2) = exp yi 0 + 1 (xi x) + 2 (xi x)
N i=1
c) Now consider the penalized …tting problem where one chooses to optimize
N
1 X 2 2
R ( 0; 1; 2) = exp yi 0 + 1 (xi x) + 2 (xi x) + 2
N i=1
For several di¤erent values of > 0, plot on the same set of axes, the optimizer
2
from a), the function 0 + 1 (x x)+ 2 (x x) optimizing R ( 0 ; 1 ; 2 ) from
b), and the functions optimizing R ( 0 ; 1 ; 2 ).
254
21. (5E2-14) At a particular input vector of interest in a SEL prediction
problem, say x, the conditional mean of yjx is 3. Two di¤erent predictors,
f^1 (x) and f^2 (x) have biases (across random selection of training sets of …xed
size N ) at this value of x that are respectively :1 and :5. The random vector of
predictors at x (randomness coming from training set selection) has covariance
matrix
f^ (x) 1 :25
Cov ^1 =
f2 (x) :25 1
If one uses a linear combination of the two predictors
there are optimal values of the constants a and b in terms of minimizing the
expected (across random selection of training sets) squared di¤erence between
f^ensemble (x) and 3 (the conditional mean of yjx). Write out and optimize an
explicit function of a and b that (in theory) could be minimized in order to …nd
these optimal constants.
and argue that these give optimal values for the constants.
c) Give an explicit expression for the expected loss of the optimal predictor
of the form fc (x). Note that together with the …rst answer this could give the
modeling penalty here.
d) Give an explicit expression for the …tting penalty if based on a training
set of size N , the value cl is estimated by
(where yl is the sample mean response for training cases with xi in the interval
corresponding to cl ).
23. (5HW-18) Consider a SEL prediction problem where p = 1, and the class
of functions used for prediction is the set of linear functions
255
Suppose that in fact
a) Under this model, what is the best element of S, say g , for predicting
y? Use this to …nd the modeling penalty/average squared model bias in this
problem.
b) What is the smallest possible expected loss here (the mean squared pre-
diction error of the theoretically best predictor, f (x) = x + 2x2 )?
Now consider the situation where N = 50 and simple linear regression (OLS) is
used to choose an element of S based on a training set. Simulate a large number
of training sets (at least 1000 of them) of this size according to the model here
using normal conditional distributions for yjx. For each simulated training set,
…nd the simple linear regression slope and intercept and use these to estimate
the mean vector and covariance matrix for the …tted regression coe¢ cients (for
this sample size and this model). Use the estimated mean and covariance as
follows.
c) Estimate the linear function of x that is the di¤erence between your
answer to a) and the average linear function produced by SLR in this context.
Find the expected square of this di¤erence according to the U(0; 1) distribution
of x. (This is an estimate of the expected squared …tting bias here.)
d) Using your estimated covariance matrix, approximate the function of x
that is the variance (across training sets) of the value on the least squares line
at x. Find the mean of this function according to the U(0; 1) distribution of x.
(This is an estimate of the expected prediction variance.)
e) In light of c) and d) what is the (estimated by simulation) …tting penalty
in this context? What then is an approximate value for Err?
24. (6HW-17) Consider the Ames house price dataset of Problem 5 above
and the famous Wisconsin breast cancer dataset on the UCI Machine Learning
Data Repository. The latter has 683 = 699 16 complete cases (16 cases are
incomplete) with p = 9 numerical characteristics of biopsied tumors, 239 of
which were malignant and 444 which were benign. Use the train() function in
the caret package in R and do the following.
a) Find a best k for k-nn SEL prediction of home selling price …rst using
repeated 8-fold cross-validation, and then LOO cross-validation. Be sure to use
standardized inputs (even for the 0-1 indicators) and to re-standardize for each
fold. Plot the cross-validation root mean squared prediction error as a function
of k. How does the training root mean squared prediction error for the best
k compare to the corresponding cross-validation root mean squared prediction
error?
b) Find a best k for k-nn classi…cation between benign and malignant cases
based on 0-1 loss, …rst using repeated 10-fold cross-validation, and then LOO
cross-validation. Be sure to use standardized inputs and to re-standardize for
each fold. Plot the cross-validation classi…cation error rate as a function of k.
256
How does the training error rate for the best k compare to the corresponding
cross-validation error rate?
25. (5E2-14) Below are class-conditional pmfs for a discrete predictor variable
x in a K = 3 class 0-1 loss classi…cation problem. Suppose that probabilities of
y = k for k = 1; 2; 3 are 1 = :4; 2 = :3; and 3 = :3. For each value of x give
the corresponding value of the optimal (Bayes) classi…er f opt .
ynx 1 2 3 4 5 6
1 :2 :1 :2 :1 :1 :3
2 :1 :1 :3 :3 :1 :1
3 :2 :1 :2 :2 :2 :1
26. (5E2-14) A training set of size N = 3000 produces counts of (x; y) pairs
as in the table below. (Assume these represent a random sample of all cases.)
For each value of x give the corresponding value of an approximately optimal
0-1 loss (Bayes) classi…er f^.
ynx 1 2 3 4 5 6
1 95 155 145 205 105 150
2 305 105 195 140 195 155
3 150 190 160 155 150 245
257
x 3 1 0 0 4
y 2 1 0 1 2
a) Write out an explicit form for the leave-one-out-cross-validation-mean-
squared-prediction-error for f^c (x) in this toy example. (This is a function of
the real variable c, say CV (c).)
b) The value of c minimizing CV (c) in a) turns out to be c^ = :9784. Show
this. Why is CV (:9784) not a good indicator of the e¤ectiveness of predic-
tion methodology that in general employs form f^c with c chosen by optimizing
CV (c)? How would you produce a reliable predictor of the performance of f^c^
in this problem? (Explain clearly and completely.)
Under the usual setup where the N pairs in T are iid according to P independent
of (x; y) P , consider P de…ned by a marginal distribution x U 21 ; 32 and
conditional distributions yjx N x; 2 .
a) Show that
2 13 2
Err1 = E y f^1 (x) = 2
+ ( 1)
12
and that
2 13
Err2 = E y f^2 (x) = 2
+ 2
9N
258
2
so that the …rst predictor is preferable to the second provided ( 1) =4 <
2 2
=3N i.e. provided ( 1) = 2 < 4=3N (a fact that is of no practical use to
a statistical learner not in full possession of the model generating the training
set!).
Consider LOOCV-guided choice between the two simple predictors for this
problem. The LOOCVMSPE for f^1 (x) is
N N N N
1 X 2 1 X 2 1 X 1 X 2
CV1 = (yi xi ) = y 2 yi xi + x
N i=1 N i=1 i N i=1 N i=1 i
P
N
Then for r(i) = N
1
1 rj , the LOOCVMSPE for f^2 (x) is
j6=i;j=1
N N N N
1 X 2 1 X 2 1 X 1 X 2 2
CV2 = yi r(i) xi = y 2 r(i) yi xi + r x
N i=1 N i=1 i N i=1 N i=1 (i) i
which is certainly NOT just min (Err1 ;Err2 ). Further, this prediction error
Errptw is NOT naively approximated by the cross-validation error of the winner
259
b) To demonstrate all this, generate 1000 simulated training sets of size
N = 27 and an additional observation pair (x; y) for each of these, using = 2
for values of = 39 ; 59 ; 79 ; : : : ; 15
9 . (This is 7 sets–one for each considered–of
1000 training sets, each of size N = 27.) For each training set, …nd f~ (x)
2
and y f~ (x) and average the squared di¤erences across the 1000 sets in
each group to produce a simulation-based estimate of Errptw for each value
of . How do these averages compare to the values of min (Err1 ;Err2 ) for
these cases? For each value of compare the distribution of 1000 random
values min (CV1 ; CV2 ) produced, to the approximate value of Errptw . Does
the random variable min (CV1 ; CV2 ) appear to be a good estimator of Errptw ?
Does it appear to be biased, and if so, in what direction?
c) Should one wish to make an honest empirical assessment of the likely
performance of f~ (x), what can be done using LOOCV is this. For each "fold"
consisting of case i use the "remainder" consisting of the other N 1 cases
to compute a "remainder i version" of the pick-the-winner predictor, say f~(i) .
PN
That is, let r(i;j) = N 1 2 ri and de…ne
l6=i;l6=j;l=1
2 3
N
X N
X
f~(i) (x) = xI 42 r(i;j) 1 yj xj < 2
r(i;j) 1 x2j 5
j6=i;j=1 j6=i;j=1
2 3
N
X N
X
+ r(i) xI 42 r(i;j) 1 yj xj > 2
r(i;j) 1 x2j 5
j6=i;j=1 j6=i;j=1
and use f~(i) (xi ) in predicting yi . The appropriate LOOCV error is then
PN 2
CVptw = N1 yi f~(i) .
i=1
For the case of in part b) with the worst match between Errptw and the
distribution of the variable min (CV1 ; CV2 ), …nd the 1000 values of CVptw . Does
the random variable CVptw seem to be a better estimator of Errptw than the
naive min (CV1 ; CV2 )? Explain.
31. .Consider the case of random variables C (i) for i 2 I (some index set)
and let C stand for the random vector/function with coordinates/entries C (i).
De…ne the random variable
i = arg minC (i)
i2I
(a minimizer of the entries of C). (We’ll assume enough regularity here that
there are no issues in de…ning this variable or any of the probabilities or expec-
tations used here.)
Suppose that of interest is the (non-random) vector/function EC, its (non-
random) optimizer
iopt = arg min (EC (i))
i2I
260
and its minimum/optimum value EC (iopt ).
a) Why is it "obvious" that
EC (i ) 6 EC iopt ?
b) Argue carefully that unless with probability 1 the non-random value iopt
is a minimizer of the random vector/function C,
EC (i ) < EC iopt
c) Say what the line of thinking in this problem implies about cross-validation
and a "pick-the-winner" prediction strategy. (Does it address the fact that al-
most always in predictive analytics contests, when …nal results based on predic-
tion for new cases are revealed they are worse than what contestants expect for
a test error?)
p (xj0) = I [ :5 < x < :5] and p (xj1) = 12x2 I [ :5 < x < :5]
for appropriate constants a; b;and c), what (knowing the answer to a) ) would
be a good choice of t (x)? (Of course, one doesn’t know the answer to a) when
doing feature selection!)
c) What is the "minimum expected loss possible" part of Err in this problem?
d) Identify the best classi…cation rule of the form gc (x) = I [x > c]. (This
is g (x) for S = fgc g. This could be thought of as the 1-d version of a "best
linear classi…cation rule" here ... where linear classi…cation is not so smart.)
What is the "modeling penalty" part of Err in this situation?
e) Suggest a way that you might try to choose a classi…cation rule gc based
on a very large training sample of size N . Notice that a large training set would
allow you to estimate cumulative conditional probabilities P [x 6 cjy] by relative
frequencies
# number of training cases with xi 6 c and yi = y
# number of training cases with yi = y
261
2. (5E1-15) Consider two probability densities on the unit disk in <2 (i.e. on
(x1 ; x2 ) j x21 + x22 6 1 ),
q
1 3
p (x1 ; x2 j1) = and p (x1 ; x2 j2) = 1 (x21 + x22 )
2
and a 2-class 0-1 loss classi…cation problem with class probabilities 1 = 2 = :5.
a) Give a formula for a best-possible single feature T (x1 ; x2 ).
b) Give an explicit form for the theoretically optimal classi…er in this prob-
lem.
p (x1 ; x2 ; x3 j1) = 2x1 ; p (x1 ; x2 ; x3 j2) = 2x2 ; and p (x1 ; x2 ; x3 j3) = 2x3
a) Identify two real-valued features T1 (x) and T2 (x) that together provide
complete summarizations of all information about the class label y 2 f1; 2; 3g
provided by x = (x1 ; x2 ; x3 ).
b) For the case of 1 = 2 = 3 = 31 give the form of an optimal 0-1 loss
classi…er in terms of the values t1 and t2 of T1 (x) and T2 (x).
c) For the case of 1 = :6; 2 = :4; and 3 = 0 where L (^ y ; 1) = 10I [^
y 6= 1]
and otherwise L (^y ; y) = I [^
y 6= y], give the form of an optimal classi…er in terms
of the value of x = (x1 ; x2 ; x3 ).
5. (6E1-17) In Section 1.4.3 there is an assertion that for a …nite set B, say
B = fb1 ; b2 ; : : : ; bm g, for jAj the number of elements in A B, one kernel
function on subsets of B is
(B could, for example, be a list of attributes that an item might or might not
possess.)
a) Prove that K is a kernel function using the "kernel mechanics" facts.
(Hint: You may …nd it useful to associate with each A B an m-dimensional
m
vector of 0s and 1s, call it xA 2 f0; 1g , with xAl = 1 exactly when bl 2 A.)
b) Let T (A) ( ) = K (A; ) = 2jA\ j map subsets of B to real-valued functions
of subsets of B. In the abstract space A (of real-valued functions of subsets of
B) what is the distance between T (A) and T (B), kT (A) T (B)kA ?
262
For N training "vectors" (Ai ; yi ) (Ai B and yi 2 <) consider the cor-
responding N points in A <, namely (T (Ai ) ; yi ) for i = 1; : : : ; N . De…ne
a k-neighborhood Nk (V ) of a point (function) V 2 A to be a set of k points
(functions) T (Ai ) with smallest kT (Ai ) V kA .
c) Carefully describe a SEL k-nn predictor of y, f (V ), mapping elements
V of A to real numbers y^ in <. Then describe as completely as possible the
corresponding predictor f (T (A)) mapping A B to y^ 2 <.
d) A more direct method of producing a kind of k-nn predictor of y is to
take account of the hint for part a) and for subsets A and C of B, to associate
m-vectors of 0s and 1s respectively xA and xC and de…ne a distance between
sets A and C as the Euclidean distance between xA and xC . This typically
produces a di¤erent predictor than the one in part c). Argue this point by
considering distances from xA and xC and from xA and xD in <m and from
T (A) and T (C) and from T (A) and T (D) in the space A for cases with jAj =
10; jCj = 4; jDj = 5; jA \ Cj = 2; and jA \ Dj = 3.
6. (6HW-13) For a > 0, consider the function K : <2 <2 ! < de…ned by
2
K (x;z) = exp kx zk
a) Use the facts about kernel functions in Section 1.4.3 to argue that K is a
2
kernel function. (Note that kx zk = hx; xi + hz; zi 2 hx; zi.)
b) Argue that there is a ' : <2 ! <1 so that with (in…nite-dimensional)
feature vector ' (x) the kernel function is a "regular <1 inner product"
1
X
K (x;z) = h'(x) ;'(z)i1 = 'l (x) 'l (z)
l=1
(You will want to consider the Taylor series expansion of the exponential func-
tion about 0 and coordinate functions of ' that are multiples of all possible
products of the form xp1 xq2 for non-negative integers p and q. It is not necessary
to …nd explicit forms for the multipliers, though that can probably be done.
You do need to argue carefully though, that such a representation is possible.)
263
8. (6E1-19) "Correlation functions" from time series and spatial modeling
(and analysis of "computer experiments") are a source of reproducing kernels
for use in machine learning. In a 1992 paper, Mitchell and Morris introduced
the useful correlation function
8 3
< 1 6d2 + 6 jdj if jdj < :5
3
(d) = 2 (1 jdj) if :5 jdj 1
:
0 if jdj > 1
(Interestingly, (d) is a natural cubic spline.55 ) Here we will use it to make the
reproducing kernel
K (x;z) (kx zk)
mapping <p <p ! <. For sake of concreteness, take p = 2.
a) For the mapping from <2 to the abstract function space A de…ned by the
kernel T (x) ( ) K (x; ), …nd numerical values for
kT (x)kA
1 1
T ((0; 0)) + 2T 2; 2 ; 3T ((1; 1)) A
N 2
X
2 2
1 1 + 2 + 2 i K(xi ; )
i=1 A
(that penalizes both the "size" of the linear part of the predictor and the "size"
of the kernel-based correction to it). Develop (for …xed 1 and 2 and training
set and using notation K for the Gram matrix) a quadratic function of the co-
e¢ cients 1 ; 2 ; 1 ; 2 ; : : : ; N that you would optimize to produce a predictor.
264
a) Optimal classi…cation for each of the 3 pairs of classes is linear classi…ca-
tion based on the features t1 and t2 . (De…ne the features and show the linear
classi…cation boundaries on axes like those below. Indicate the scales for the
features.)
10. (5HW-16) Return to the context of Problem 13 of Section A.2 and the
last/largest set of predictors. Center the y vector to produce (say)Y , remove
the column of 1s from the X matrix (giving a 100 9 matrix) and standardize
the columns of the resulting matrix, to produce (say) X .
a) If one somehow produces a coe¢ cient vector for the centered and
standardized version of the problem, so that
yb = 1 x1 + 2 x2 + + 9 x9
265
11. (5HW-18) Consider a toy 3-class classi…cation problem with conditional
2
distributions xjy that are N(0; 1) for y = 1, N 1; (:5) (the standard deviation
is :5) for y = 2; and N(2; 1) for y = 3 and class probabilities that are 1 = 2 =
3 = 1=3.
a) Plot the three functions
P [y = 1jx] ; P [y = 2jx] ; and P [y = 3jx]
b) The exposition identi…es an optimal pair of "features" for this 3-class
problem. Plot those two features, say t1 (x) and t2 (x) on the same set of axes.
c) Show that the optimal 3-class 0-1 loss classi…er for any set of class prob-
abilities 1 ; 2 ; and 3 can be written as a function of the features from b).
2
12. (5E1-20) 4. It is well-known that K (z; x) = (1 + xz) mapping <2 ! <
is a legitimate "kernel function."
a) Suppose that for the training data of Problem 28 in Section A.2, one
determines to …t a predictor for y of the form
5
X
f^ (x) = i K(x; xi )
i=1
266
A.4 Section 1.5 Exercises
1. (6E1-17) Consider the 2-class classi…cation model with the coding y 2
f 1; 1g and (for sake of concreteness) x 2 <1 . For g (x) a generic voting function
we’ll consider the classi…er
g (xj 0; 1) =2 ( 0 + 1 x) 1
penalized by 2
1 for a > 0. ( is the standard normal cdf.) In as simple
a form as possible, give two equations to be solved simultaneously to do this
…tting.
d) Suppose that as a matter of fact the two class-conditional densities op-
erating are
and that ultimately what is desired is a good ordering function O (x), one that
produces a small value of the "AUC" criterion. Do you expect the methodology
of part c) to produce a function g xj ^0 ; ^1 that would be a good choice of
O (x)? Explain carefully.
Eh (yg (x))
y 1 1 1 1 1 1 1 1
x 1 2 3 4 5 6 7 8
267
Consider the production of a "voting function" of the form
8
X 2
gb (x) = bi exp c jx xi j
i=1
by choice of the 8 coe¢ cients bi (for some choice of c > 0) under the "function
loss" h2 (u) = exp ( u). (In the parlance of machine learning, the component
2
functions exp c jx xi j are data-dependent p = 1 "radial basis functions.")
In fact, consider "penalized" …tting.
a) One possible penalized …tting criterion is
8 8
1X X
exp ( yi gb (xi )) + b2i
8 i=1 i=1
for some > 0. For choices of c = :5 and c = 1 optimize this criterion for two
di¤erent values of > 0 and plot the four resulting voting functions on the same
set of axes. Choose (by trial and error) two values of that produce clearly
di¤erent optimizing functions. (optim in R or some other canned routine will
be adequate to do this 8-d optimization.)
2
b) The function K (x; z) = exp c jx zj is a "kernel function" in the
sense of Section 1.4.3. That implies that the 8 8 Gram matrix
K = (K(xi ; xj )) i=1;2;:::;8
j=1;2;:::;8
P
8
possible penalized …tting criterion replaces b2i in part a) with b0 Kb. For the
i=1
same values of c and you used in part a) redo the optimization using this
second penalization criterion and plot the resulting voting functions. Notice, by
the way, that the penalty in a) is a c ! 1 limit of this second penalty!
c) As indicated in Section 1.4.3, the mapping T (x) = K (x; ) from <1 to
functions <1 ! <1 picks out N = 8 functions that are essentially normal pdfs.
Linear combinations of these form a linear subspace of this function space.
Further, there is a valid inner product that can be de…ned on this subspace, for
which
hT (x) ; T (z)iA = K (x; z)
Using this inner product,
what is the inner product of two elements of this subspace, say gb (x) and
gb (x)?
what is the distance between T (x) and T (z),
1=2
kT (x) T (z)kA = hT (x) T (z) ; T (x) T (z)iA ?
268
how is the penalty in b) related to kgb kA (the norm of the linear combi-
nation of functions in the function space)?
x 1 2 3 4 5 6 7 8 9 10
p (xjy = 1) :04 :07 :06 :03 :24 0 :02 :09 :25 :2
p (xjy = 0) :1 :1 :1 :1 :1 :1 :1 :1 :1 :1
a) If P [y = 1] = 2=3 what is the optimal classi…er here and what is its error
rate (for 0-1 loss)?
b) If one cannot observe x completely, but only
8
>
> 2 if x is 1 or 2
>
>
< 4 if x is 3 or 4
x = 6 if x is 5 or 6
>
>
>
> 8 if x is 7 or 8
:
10 if x is 9 or 10
instead, what is the optimal classi…er and what is its error rate (again assuming
that P [y = 1] = 2=3 and using 0-1 loss)?
5. (6E1-19) Return to the situation of Problem 9 of Section A.2. For this toy
dataset the 2 classes are balanced, and a 3-nearest-neighbor neighborhood has
a fraction of "class 1" cases 0; 13 ; 23 ; or 1. Suppose that 3-nn results from this
training set will be used to produce a 0-1 loss classi…er for a scenario in which
(there is severe class imbalance and) the actual probabilities of classes are 0 =
:1 and 1 = :9. Find (and carefully argue that it is correct) the classi…cation
appropriate for an x for which the 3-nearest-neighbor neighborhood has fraction
1
3 of "class 1" cases.
a) Why do you know that b1 (g) 6 b2 (g)? Under what circumstances will
it be the case that b1 (g) < b2 (g)?
b) If 1) g minimizes b1 (g) over choices of g, 2) g minimizes b2 (g) over
choices of g, and 3) in fact your conditions in a) are met to imply that b1 (g ) <
269
b2 (g ), does it necessarily follow that g is a strictly better voting function
(produces a better error rate) than g for the original 0-1 loss classi…cation
problem? Explain why or why not.
P [O(x) <O(x )]
270
b) Find the p ((x1 ; x2 ) jk) conditional densities for x2 jx1 . Note that based
on these and the marginals in part a) you can simulate pairs from any of the 4
joint distributions by …rst using the inverse probability transform of a uniform
variable to simulate from the x1 marginal and then using the inverse probability
transform to simulate from the conditional of x2 jx1 . (It’s also easy to use a
2
rejection algorithm based on (x1 ; x2 ) pairs uniform on (0; 1) .)
c) Generate 2 datasets consisting of multiple independent pairs (x; y) where
y is uniform on f1; 2; 3; 4g and conditioned on y = k, the variable x has density
p ((x1 ; x2 ) jk). Make …rst a small training set with N = 400 pairs (to be used
below). Then make a larger test set of 10; 000 pairs. Use the test set to evaluate
the (conditional on the training set) error rates of the optimal rule from Problem
7 Section A.2 and then the "naïve" rule from part a).
d) Based on the N = 400 training set from c), for several di¤erent numbers
of neighbors (say 1; 3; 5; 10) make a plot like that required in part c) showing the
regions where the nearest neighbor classi…er classi…es to each of the 4 classes.
Then evaluate the (conditional on the small training sets) test error rates for
the nearest neighbor rules.
e) Based on the training set, one can make estimates of the 2-d densities as
1 X
2
p^ (xjk) = h xjxi ;
# [i with yi = k]
i with yi =k
for h j ; 2 the bivariate normal density with mean vector and covariance
matrix 2 I. (Try perhaps :1.) Using these estimates and the relative
frequencies of the possible values of y in the training set
# [i with yi = k]
^k =
N
an approximation of the optimal classi…er is
X
f^ (x) = arg max ^k p^ (xjk) = arg max h xjxi ; 2
k k
i with yi =k
Make a plot like that required in part a) showing the regions where this classi…es
to each of the 4 classes. Then evaluate the (conditional on the training set) test
error rate for this classi…er.
271
2
Use the Gaussian kernel function K (x; z) = exp kx zk in what fol-
lows. (k k is the usual <p norm.)
a) For an input vector xi 2 <2 , what is the norm of T (xi ) in the abstract
space?
b) For input vectors xi 2 <2 and xl 2 <2 , how is the distance between
T (xi ) and T (xl ) in the abstract space related to the distance between xi and
xl in <p ?
2. (6E1-11) Consider the p-dimensional input space <p and kernel functions
mapping <p <p ! <.
a) Show that for : <p ! <, the function K (x;z) = (x) (z) is a valid
kernel. (You must show that for distinct x1 ; x2 ; : : : ; xN , the N N matrix
K = (K(xi ;xj )) is non-negative de…nite.)
b) Show that for two kernels K1 (x;z) and K2 (x;z) and two positive con-
stants c1 and c2 , the function c1 K1 (x;z) + c2 K2 (x;z) is a kernel.
c) By virtue of a) and b), the functions K1 (x; z) = 1 + xz and K2 (x; z) =
2
1 + 2xz are both kernels on [ 1; 1] . They produce inner product spaces of
functions. Show these are di¤erent.
2
3. (6E1-15) Consider the small space of functions on [ 1; 1] that are linear
combinations RR
of the 4 functions 1; x1 ; x2 ; and x1 x2 , with inner product de…ned
by hh; gi = h (x1 ; x2 ) g (x1 ; x2 ) dx1 dx2 . Find the element of this space
[ 1;1]2
2
closest to h (x1 ; x2 ) = x21 + x22 (in the L2 [ 1; 1] function space norm kgk
1=2
hg; gi ). (Note that the functions 1; x1 ; x2 ; and x1 x2 are orthogonal with this
inner product.)
272
2
3. (6HW-13) Consider the linear space of functions on [0; 1] of the form
2 2
5. (6E2-13) Consider the function K ((x; y) ; (u; v)) mapping [ 1; 1] [ 1; 1]
to < de…ned by
2 2 2
K ((x; y) ; (u; v)) = (1 + xu + yv) + exp (x u) (y v)
on its domain.
a) Argue carefully that K is a legitimate "kernel" function.
b) Pick any two linearly independent elements of the space of functions that
2
are linear combinations of "slices" of the kernel, K ((x; y) ; ), for an (x; y) 2 [ 1; 1]
and …nd an orthonormal basis for the 2-dimensional linear sub-space they span.
mapping <2 <2 ! < is a kernel function. Consider three real-valued functions
(of z 2 <2 ):
Using the inner product for the linear space of functions mapping <2 ! <
de…ned for kernel slices by hT (x) ; T (w)iA = K (x;w), …nd the projection of
T ((0; 0)) onto the subspace of functions spanned by the two functions T ((1; 0))
and T ((0; 1)) (i.e. the set of all linear combinations c T ((1; 0)) + d T ((0; 1))
for constants c and d).
273
7. (5HW-18) Below is a small fake dataset with p = 2 and N = 8.
x1 x2 y
1 0 2:03
0 1 :56
1 0 2:21
0 1 1:46
2 2 5:78
1 1 :72
2 2 6:46
1 1 1:37
First center the y values and standardize both x1 and x2 . (We will abuse
notation and use x and z to stand for standardized versions of input vectors.)
2
Make use of the kernel function K (x;z) = exp kx zk and the mapping
2
T (x) = K (x; ) that associates with input vector x 2 < the function K (x; ) :
<2 ! < (an abstract "feature"). In the (very high-dimensional) space of
functions mapping <2 ! <, the N = 8 training set generates an 8-d subspace
of functions consisting of all linear combinations of the T (xi ). Two possible
inner products in that subspace are the "L2 " inner product
ZZ
hg; hiL2 = g (x) h (x) dx
<2
Apply the …rst 3 steps of the Gram-Schmidt process to the abstract features of
the training data (considered in the order given in the data table) to identify 3 or-
thonormal functions <2 ! < that are linear combinations of T (x1 ) ; T (x2 ) ; T (x3 ).
Do this …rst using the L2 inner product, and then using the kernel-based inner
product. Are the two sets of 3 functions the same?
274
b) Use the singular value decomposition of X to …nd the eigen (spectral)
decompositions of X 0 X and XX 0 (what are eigenvalues and eigenvectors?).
c) Find the best rank = 1 and rank = 2 approximations to X.
2. (6HW-11) Carry out the steps of Problem 1 above using the matrix
2 3
1 1 1
6 2 1 1 7
X=6 4 1 2 1 5
7
2 2 1
3. (6E2-15) Here is some simple R code and output for a small N = 5 and
p = 4 dataset.
>X
[,1] [,2] [,3] [,4]
[1,] 0.4 2 -0.5 0
[2,] -0.1 0 -0.3 1
[3,] 0.4 0 -0.1 0
[4,] 0.4 0 0.0 -1
[5,] 0.1 2 0.7 0
>
>svd(X)
$d
[1] 2.8551157 1.4762267 0.9397253 0.3549439
$u
[,1] [,2] [,3] [,4]
[1,] 0.70256076 0.06562895 0.6458656 -0.2618499
[2,] -0.01458943 0.69768837 0.1798028 0.2661822
[3,] 0.01628552 -0.05282808 0.2689008 0.8815301
[4,] 0.02268773 -0.71093125 0.2403923 0.1625961
[5,] 0.71092586 -0.02664090 -0.6484076 0.2388488
$v
[,1] [,2] [,3] [,4]
[1,] 0.12929953 -0.23823242 0.403567340 0.8738766
[2,] 0.99014314 0.05282123 -0.005410155 -0.1296041
[3,] 0.05222766 -0.17306746 -0.912659300 0.3665691
[4,] -0.01305627 0.95420275 -0.064475843 0.2918382
275
A.9 Section 2.4 Exercises
1. (6HW-15) Center the columns of X from Problem 1 of Section A.8 to make
f
the centered data matrix X.
f What are the principal
a) Find the singular value decomposition of X.
component directions and principal components for the data matrix? What are
the "loadings" of the …rst principal component?
f
b) Find the best rank = 1 and rank = 2 approximations to X.
0
c) Find the eigen decomposition of the sample covariance matrix 15 Xf X.
f
Find best 1- and 2-component approximations to this covariance matrix.
f
f Repeat
d) Now standardize the columns of X to make the matrix X.
f
f
parts a), b), and c) using this matrix X.
2. (6HW-11) Carry out the steps of Problem 1 above using the matrix X
from Problem 2 of Section A.8.
276
Are the eigenvectors of the sample covariance matrix related to the principal
component directions of the (centered) data matrix? If so, how? Are the eigen-
values/singular values of the sample covariance matrix related to the singular
values of the (centered) data matrix. If so, how?
are legitimate kernel functions for choice of > 0 and positive integer d. Find
the …rst two kernel principal component vectors for X in Problem 3 above for
each of cases
If there is anything to interpret (and there may not be) give interpretations
of the pairs of principal component vectors for each of the 4 cases. (Be sure to
use the vectors for "centered versions" of the function space principal component
"direction vectors"/functions.)
277
For d = 2 consider the c = 1 and c = 2 cases of this construction for p = 2.
b) Describe the sets of functions mapping <2 ! < that comprise the abstract
linear spaces associated with the reproducing kernels. What is the dimension
of these spaces?
c) Identify for each case a transform T : <2 ! <M so that
x1 x2
1 0
0 1
1 0
0 1
2 2
1 1
2 2
1 1
f ) Note that the fake dataset of part e) is centered in <2 . Find ordinary prin-
cipal component direction vectors v 1 and v 2 and corresponding 8-dimensional
vectors of principal component scores for the dataset. Then …nd the …rst two
kernel principal component vectors corresponding to the c = 1 case of K.
278
2
exp 3 kx zk . The data and Gram matrices are
0 1 0 1
1:01 :99 1 1
B :99 1:01 C B 1 1 C
B C B C
B :01 :01 C B 0 0 C
X=B
B
C
C
B
B
C
C
B 0 0 C B 0 0 C
@ :01 :01 A @ 0 0 A
2:00 2:00 2 2
0 1 0 1
1 :998 :003 :003 :003 :000 1 1 0 0 0 0
B :998 1 :003 :003 :003 :000 C B 1 1 0 0 0 0 C
B C B C
B :003 :003 1 :999 :998 :000 C B 0 0 1 1 1 0 C
and K = B
B
C
C
B
B
C
C
B :003 :003 :999 1 :999 :000 C B 0 0 1 1 1 0 C
@ :003 :003 :998 :999 1 :000 A @ 0 0 1 1 1 0 A
:000 :000 :000 :000 :000 1 0 0 0 0 0 1
Say what both principal components analysis on the raw data and kernel prin-
cipal components indicate about these data.
8. (6E1-19) Let
2 3 2 3
15 5 1 1 1 1
6 15 5 1 7 6 1 1 1 7
6 7 6 7
6 5 15 1 7 6 1 1 1 7
6 7 6 7
6 5 15 1 7 1 6 1 1 1 7
X=6
6
7 U=
7 p 6
6
7
7
6 5 15 1 7 8 6 1 1 1 7
6 5 15 1 7 6 1 1 1 7
6 7 6 7
4 15 5 1 5 4 1 1 1 5
15 5 1 1 1 1
0 1 2 p1 p1
3
40 2 2
0
D = diag @ p20 A and V = 4 p1
2
p1
2
0 5
2 2 0 0 1
U ; D; and V are the elements of the SVD for X. Use this to answer the
following.
a) Find the best rank = 1 approximation to the matrix X.
b) Identify a (3 1) unit vector w such that the 8 row vectors in X lie
"nearly on" a plane in <3 perpendicular to w.
279
c) Give the eigen decomposition of the (8-divisor) sample covariance matrix
of a p = 3 dataset with cases given by the rows of X. (Give the 3 eigenvalues
and corresponding eigenvectors.)
10. (5HW-18) Return to the context of Problem 7 of Section A.7. Note that
P
8
the function MT = 81 T (xi ) is a linear combinations of (is in the subspace
i=1
of functions generated by) the T (xi ). It makes sense to "center" the abstract
features generated by the training set, replacing each T (xi ) with
S (xi ) = T (xi ) MT
a) Compute the matrix
C = hS (xi ) ; S (xj )iA i = 1; 2; : : : ; 8
j = 1; 2; : : : ; 8
that is the "centered Gram matrix" for kernel PCA in displays (48) and (49).
c) Do an eigen analysis for the matrix C. (For Euclidean features, this
matrix would be a multiple of a sample covariance matrix.) The eigenvectors
of this matrix give kernel principal component scores for the dataset. Consider
the …rst and second of these. To the extent possible, provide interpretations
for them.
d) Find the projection of the function S (:5; :5) onto the span of fT (xi )gi=1;:::;8
in A and compare contour plots for the function and its projection.
280
Center the y values and standardize x. (We will abuse notation and use x
and z to stand for standardized versions of input values.)
2
This question will make use of the kernel function K (x; z) = exp :5 (x z)
and the mapping T (x) ( ) = K (x; ) that associates with input value x 2 <
the function K (x; ) : < ! < (an abstract "feature"). In the (very high-
dimensional) space of functions mapping < ! <, the N = 11 training set
generates an 11-d subspace of functions consisting of all linear combinations
1
P
11
of the T (xi ). As in Problem 10 above set MT = 11 T (xi ) and de…ne
i=1
S (xi ) = T (xi ) MT .
a) Compute the matrix
that is the "centered Gram matrix" for kernel PCA in displays (48) and (49).
c) Do an eigen analysis for the matrix C. (For Euclidean features, this
matrix would be a multiple of a sample covariance matrix.) The eigenvectors
of this matrix give kernel principal component scores for the dataset. Consider
the …rst and second of these. To the extent possible, provide interpretations
for them.
d) Find the projection of the function S (:65) onto the span of fT (xi )gi=1;:::;11
in A and plot the function and its projection on the same set of axes.
In answering the following, use the notation that the jth column of X is xj .
281
ols
a) Find the …tted OLS coe¢ cient vector ^ .
b) For = 10 …nd the vector c 2 <3 minimizing
ols 0 ols
Y X diag(c) ^ Y X diag(c) ^ + 10 c
282
4. (5HW-14) Here is a small fake dataset with p = 4 and N = 8.
y x1 x2 x3 x4
3 1 1 1 1
5 1 1 1 1
13 1 1 1 1
9 1 1 1 1
3 1 1 1 1
11 1 1 1 1
1 1 1 1 1
5 1 1 1 1
0 1
N
X p
X p
X
2
(yi y^i ) + @(1 ) ^j + ^2 A
j
i=1 j=1 j=1
283
6. (5HW-14) Return to the context of Problem 13 of Section A.2 and the
last/largest set of predictors. Center the y vector to produce (say) Y , remove
the column of 1s from the X matrix (giving a 100 9 matrix) and standardize
the columns of the resulting matrix, to produce (say) X .
a) Augment Y to Y by adding 9 values 0 at the end of the vector (to
produce a 109 1 vector) and for value = 4 augment X to X (a 109 p 9
matrix) by adding 9 rows at the bottom of the matrix in the form of I .
9 9
What quantity does OLS based on these augmented data seek to optimize?
What is the relationship of this to a ridge regression objective?
b) Use trial and error and matrix calculations based on the explicit form
ridge
of ^ given in Section 3.1.1 to identify a value ~ for which the error sum
of squares for ridge regression is about 1:5 times that of OLS in this problem.
Then make a series of at least 5 values from 0 to ~ to use as candidates for .
Choose one of these as an "optimal" ridge parameter opt here based on 10-
fold cross-validation (as was done in Problem 13 of Section A.2). Compute the
corresponding predictions y^iridge and plot both them and the OLS predictions
as functions of x (connect successive (x; y^) points with line segments). How do
the "optimal" ridge predictions based on the 9 predictors compare to the OLS
predictions based on the same 9 predictors?
(Training data are N vectors (z1i ; z2i ; yi ).) For this problem, one might de…ne
a (log-likelihood-based) training error as
N
X N
X
err (a; b1 ; b2 ) = ln (1 + exp (a + b1 z1i + b2 z2i )) yi (a + b1 z1i + b2 z2i )
i=1 i=1
How would you regularize …tting of this model "in ridge regression style" (pe-
nalizing only b1 and b2 and not a)? Derive 3 equations that you would need to
solve simultaneously to carry out regularized …tting.
284
9. (6HW-15) Show the equivalence of the two forms of the optimization used
to produce the …tted ridge regression parameter. (That is, show that there is a
ridge ridge ridge ridge
t ( ) such that ^ =^ and a (t) such that ^
t( ) =^ t .) (t)
10. (5E1-16) (Ridge regression produces a "grouping e¤ect" for highly corre-
lated predictors) Suppose that in a p-variable SEL prediction problem, input
variables x1 ; x2 ; x3 have very large absolute correlations. Upon standardization
(and arbitrary change of signs of the standardized variables so that all correla-
tions are positive) the variables are essentially the same, and every combination
3
X
wj x00j for w1 ; w2 ; w3 with w1 + w2 + w3 = 1
j=1
is essentially the same. So every set of coe¢ cients 1 ; 2 ; 3 with a given sum
P
3
00
B = 1 + 2 + 3 has nearly the same j xj . Argue then that any minimizer
j=1
!!
PN Pp P
p
of yi 0+ jx
00
+ j
2
has ^ridge
j
^ridge
1
^ridge .
2 3
i=1 j=1 j=1
11. (5HW-18) For the situation of Problem 7 of Section A.7 (with centered
response and standardized inputs x1 and x2 ) do the following concerning linear
predictors
f^ (x1 ; x2 ) = b1 x1 + b2 x2
a) Plot on the same set of axes the two values b1 and b2 as functions of
(or ln if that is easier to compute or interpret) for ridge regression predictors.
b) Plot on the same set of axes the two values b1 and b2 as functions of
(or ln if that is easier to compute or interpret) for lasso regression predictors.
285
2. (6HW-11) Beginning in its Section 5.6, Izenman’s book uses an example
where PET yarn density is to be predicted from its NIR spectrum. This is a
problem where N = 21 data vectors xj of length p = 268 are used to predict
the corresponding outputs yi . Izenman points out that the yarn data are to
be found in the pls package in R. (The package actually has N = 28 cases.
Use all of them in the following.) Get those data and make sure that all inputs
are standardized and the output is centered. (Use the N divisor for the sample
variance.)
a) Using the pls package, …nd the 1; 2; 3; and 4-component PCR and PLS
^ vectors.
b) Find the singular values for the matrix X and use them to plot the
function df( ) for ridge regression. Identify values of corresponding to e¤ective
degrees of freedom 1; 2; 3; and 4. Find corresponding ridge ^ vectors.
c) Plot on the same set of axes ^j versus j for the PCR, PLS and ridge vectors
for number of components/degrees of freedom 1. (Plot them as "functions,"
connecting consecutive plotted j; ^j points with line segments.) Then do the
same for 2; 3; and 4 components/degrees of freedom.
d) It is (barely) possible to …nd that the best (in terms of R2 ) subsets of M =
1; 2; 3; and 4 predictors for OLS are respectively, fx40 g,fx212 ; x246 g,fx25 ; x160 ; x215 g,
and fx160 ; x169 ; x231 ; x243 g. Find their corresponding coe¢ cient vectors. Use
the lars package in R and …nd the lasso coe¢ cient vectors ^ with exactly
P ^lasso
268
M = 1; 2; 3; and 4 non-zero entries with the largest possible j (for the
j=1
counts of non-zero entries).
e) If necessary, re-order/sort the cases by their values of yi (from smallest to
largest) to get a new indexing. Then plot on the same set of axes yi versus i and
y^i versus i for ridge, PCR, PLS, best subset, and lasso regressions for number
of components/degrees of freedom/number of nonzero coe¢ cients equal to 1.
(Plot them as "functions," connecting consecutive plotted (i; yi ) or (i; y^i ) points
with line segments.) Then do the same for 2; 3; and 4 components/degrees of
freedom/counts of non-zero coe¢ cients.
f ) Use the glmnet package in R to do ridge regression and lasso regression
here. Find the value of for which your lasso coe¢ cient vector in d) for M = 2
optimizes the quantity
N
X 268
X
2 ^j
(yi y^i ) +
i=1 j=1
(by matching the error sums of squares). Then, by using the trick of Problem
1 Section A.10 employ the package to …nd coe¢ cient vectors ^ optimizing
0 1
XN 268
X 268
X
2
(yi y^i ) + @(1 ) ^j + ^2 A
j
i=1 j=1 j=1
for = 0; :1; :2; : : : ; 1:0. What e¤ective degrees of freedom are associated with
the = 1 version of this? How many of the coe¢ cients j are non-zero for each
286
of the values of ? Compare error sum of squares for the raw elastic net pre-
dictors to that for the linear predictors using (modi…ed elastic net) coe¢ cients
(1 + ) ^enet
;
a) Find the OLS prediction vector y^ols here. (This is trivial. Note that the
8 columns of X are orthogonal.)
287
b) Find the 1-component PLS prediction vector y^pls here.
c) After normalizing the predictors (so that the <8 norm of each column
of the normalized X is 1) …nd the lasso prediction vector y^lasso for the penalty
parameter = 10. (Center the vector of responses, remove the …rst column of
the X and work with an 8 7 vector of inputs.)
d) Using the normalized version of the predictors referred to in part c) …nd
a vector of coe¢ cients b that minimizes
0
(y Xb) (y Xb) + b0 diag (0; 0; 0; 4; 4; 4; 4) b
288
A.13 Section 4.2 Exercises
1. (6HW-11) Find a set of basis functions for the natural (linear outside the
interval ( 1 ; K )) quadratic regression splines with knots at 1 < 2 < < K.
2. (6HW-11) (B-Splines) For a < 1 < 2 < < K < b consider the B-
spline bases of order m, fBi;m (x)g de…ned recursively as follows. For j < 1
de…ne j = a, and for j > K let j = b. De…ne
Bi;1 (x) = I [ i 6x< i+1 ]
(where we understand that if Bi;l (x) 0 its term drops out of the expression
above). For a = 0:1 and b = 1:1 and i = (i 1) =10 for i = 1; 2; : : : ; 11,
plot the non-zero Bi;3 (x). Consider all linear combinations of these functions.
Argue that any such linear combination is piecewise quadratic with …rst deriva-
tives at every i . If it is possible to do so, identify one or more
P linear constraints
on the coe¢ cients (call them ci ) that will make qc (x) = ci B3;i (x) linear to
i
the left of 1 (but otherwise minimally constrain the form of qc (x)).
289
3 K j 3 K 1 j 3
hj+2 (x) = (x j )+ (x K 1 )+ + (x K )+
K K 1 K K 1
7. (6HW-17) You instructor will provide a dataset giving the maximum num-
bers of home runs hit by a "big league" professional baseball player in the US for
each of 145 consecutive seasons. Consider these as values y1 ; y2 ; : : : ; y145 and
take xi = i. Consider the basis functions for natural cubic splines with knots
j of the general form in Problem 5 above. Using knots j = 2 + (j 1) 14 for
j = 1; 2; : : : ; 11 …t a natural cubic regression spline to the home run data. Plot
the …tted function on the same axes as the data points.
S1 = [0; :5] [0; :5] ; S2 = [0; :5] [:5; 1] ; S3 = [:5; 1] [0; :5] ; and S4 = [:5; 1] [:5; 1]
E [yjx1 ; x2 ] = 2x1 x2
Find the best …tting linear combination of the basis functions according to least
squares.
c) Describe a set of basis functions for all continuous functions on [0; 1] [0; 1]
that for
0= 0 < 1 < 2 < < K 1 < K = 1 and 0 = 0 < 1 < < M 1 < M =1
290
A.15 Section 5.1 Exercises
1. (6HW-11) Suppose that a < x1 < x2 < < xN < b and s (x) is a
natural cubic spline with knots at the xi interpolating the points (xi ; yi ) (i.e.
s (xi ) = yi ).
a) Let z (x) be any twice continuously di¤erentiable function on [a; b] also
interpolating the points (xi ; yi ). Show that
Z b Z b
2 2
(s00 (x)) dx 6 (z 00 (x)) dx
a a
000
and use integration by parts and the fact that s (x) is piecewise constant.)
P
N
2 Rb 2
b) Use a) and prove that the minimizer of (yi h (xi )) + a (h00 (x)) dx
i=1
over the set of twice continuously di¤erentiable functions on [a; b] is a natural
cubic spline with knots at the xi .
3 K j 3 K 1 j 3
hj+2 (x) = (x j )+ (x K 1 )+ + (x K )+
K K 1 K K 1
xN ) I [xN 6 x 6 b]
xN xj xN 1 xj
+ 6 (x xj ) 1
(x xN xN xN 1) + xN xN 1 (x
= 6 (x xj ) I [xj 6 x 6 xN 1 ]
6 x 6 xN ]
xj xN 1 xN xj
+6 x xN xN 1
+ xN 1 xN xN 1
xj I [xN 1
xj ) I [xj 6 x 6 xN 6 x 6 xN ]
xj xN 1
= 6 (x 1] + 6 (x xN ) xN xN 1
I [xN 1
291
Thus for j = 1; 2; 3; : : : ; N 2
R b 00 2
a
hj+2 (x) dx
2
3 3 xN 1 xj
= 12 (xN 1 xj ) + (xN xN 1) xN xN 1
3 2
= 12 (xN 1 xj ) + (xN xN 1 ) (xN 1 xj )
2
= 12 (xN 1 xj ) (xN xj )
292
2. (6HW-13) A p = 2 dataset provided with these notes consists of N = 81
2
training vectors (x1i ; x2i ; yi ) for pairs (x1i ; x2i ) in the set f 2:0; 1:5; : : : ; 1:5; 2:0g
where the yi were generated as
2
for iid N 0; (:1) variables i . Use it in the following.
a) Why should you expect MARS to be ine¤ective in producing a predictor
in this context? (You may want to experiment with the earth package in R
trying out MARS.)
b) Fit a thin plate spline to these data using the Tps function in the fields
package.
and N = 101 training data pairs are available with xi = (i 1) =100 for i =
1; 2; : : : ; 101. A dataset like this is provided with these notes. Use it in the
following.
a) Fit all of the following using …rst 5 and then 9 e¤ective degrees of freedom
293
Plot for 5 e¤ective degrees of freedom all of yi and the 3 sets of smoothed
values against xi . Connect the consecutive (xi ; y^i ) for each …t with line segments
so that they plot as "functions." Then redo the plotting for 9 e¤ective degrees
of freedom.
b) For all of the …ts in a) plot as a function of i the coe¢ cients ci applied to
P
101
the observed yi in order to produce f^ (x) = ci yi for x = :05; :1; :2; :3. (Make
i=1
a di¤erent plot of three curves for 5 degrees of freedom and each of the values
x (four in all). Then redo the plotting for 9 degrees of freedom.)
1:5 2
yjx N sin + exp( 2x); (:5)
x + :1
(the conditional standard deviation is :5) and N = 101 training data pairs are
available with xi = (i 1) =100 for i = 1; 2; : : : ; 101. A dataset like this is
provided with these notes. Use it in place of the dataset described in Problem
1 above and redo all of that problem.
3 1 1 1
p (x) = I 0<x< + I <x<1 on [0; 1]
2 2 2 2
and the conditional distribution of yjx is N(x; 1). Suppose training data (xi ; yi )
for i = 1; : : : ; N are iid P and that with the standard normal pdf, one uses
the Nadaraya-Watson estimator for E[yjx = :5] = :5,
P
N
yi (:5 xi )
i=1
f^ (:5) =
PN
(:5 xi )
i=1
Use the law of large numbers and the continuity of the ratio function and write
out the (in probability) limit for f^ (:5) in terms of a ratio of two de…nite integrals
and then argue that the limit is not :5.
294
nearly equivalent to kernel smoothing, in light of your plot in b) of that problem
identify a kernel that might provide smoothed values similar to those for the
penalty used there. (Name a kernel and choose a bandwidth.)
except for the "edge" cases where we’ll take y^1 = :5y1 + :5y2 and y^21 = :5y20 +
:5y21 .
a) For S the smoother matrix to be applied to a vector of observations Y =
(y1 ; y2 ; : : : ; y21 ) to get smoothed values, what are e¤ective degrees of freedom?
b) What are (except for the "edge" cases, now with indices j = 1; 2; 20; and
21) the weights, say w2 (ji jj), used to make "doubly smoothed" values via
two successive applications of the original smoothing. That is, forY^ = SSY ?
What (approximately, you don’t need to get exactly the right terms for the edge
cases) are e¤ective degrees of freedom for SS?
c) Consider local linear regression in this same context, where the original
weights are used and thus (except for edge cases) the slope and intercept used
to make y^j are determined by minimizing
2 2 2
:25 (yj 1 ( 0 + 1 xj 1 )) +:5 (yj ( 0 + 1 xj )) +:25 (yj+1 ( 0 + 1 xj+1 ))
295
(or equivalently 4 times this quantity). Ultimately (again except for edge
cases) what weights go into a smoother matrix for an "equivalent N-W ker-
nel smoother" in this case? (It may be helpful to recall that OLS for SLR
PN P
N
2
produces b1 = (yi y) (xi x) = (xi x) and b0 = y b1 x.)
i=1 i=1
d) Use R to compute the nth power of S for a reasonably large n. Why is
this form really no surprise?
296
[1] -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 -0.302
-0.302
9. (6HW-17) Consider again the home run dataset of Problem 7 Section A.13.
Fit with …rst approximately 5 and then 9 e¤ective degrees of freedom
Plot for approximately 5 e¤ective degrees of freedom all of yi and the 2 sets
of smoothed values against xi . Connect the consecutive (xi ; y^i ) for each …t with
line segments so that they plot as "functions." Then redo the plotting for 9
e¤ective degrees of freedom.
10. (6HW-19) Consider again the fake data of Problem 2 of Section A.15.
Carry out the steps of Problem 9 above on this dataset.
297
c) For training data as below, what is f^ (:4)?
y 1 3 2 4 2 6 7 9 7 8 6
x 0 :1 :2 :3 :4 :5 :6 :7 :8 :9 1
x1 x2 y
1 0 4
0 1 2
0 0 0
0 1 8
1 0 6
a) Find the OLS predictor of y of the form y^ = f^ (x) = b0 +b1 x1 +b2 x2 . Show
"by hand"qcalculations. Note
q that predictors x1 and x2 can be standardized
q
5 5 1
to x01 = 0
2 x1 and x2 =
00
2 x2 and made orthonormal as x1 = 2 x1 and
q
x002 = 12 x2 .
b) Consider the penalized least squares problem of minimizing (for ortho-
normal predictors x001 and x002 ) the quantity
5
X
00 00 2
(yi ( 0 + 1 x1 + 2 x2 )) + (j 1j +j 2 j)
i=1
Plot on the same set of axes minimizers ^1lasso and ^2lasso as functions of .
pls
c) Evaluate the …rst PLS component z 1 in this problem and …nd ^ 2
2
< (for centered y values and standardized predictors so that the matrix of
pls
predictors x0 is 5 2) so that Y^ pls = X ^ for a 1-component PLS predictor.
Show "by hand" calculations.
298
d) Since standardization requires multiplying x1 and x2 by the same con-
stant, the 3-nn predictor here is the same whether computed on the raw (x1 ; x2 )
values or after standardization. What is it? (It takes on only a few di¤erent
values. Give those values and specify the regions in which they pertain in terms
of the original variables.)
e) (Again, since standardization requires multiplying x1 and x2 by the same
constant) 2-d kernel smoothing methods applied on original and standardized
scales are equivalent. So consider locally weighted bivariate regression done
on the original scale using the Epanechnikov quadratic kernel and bandwidth
= 1. Write out (in completely explicit terms) the sum to be optimized by
choice of constants 0 ; 1 ; 2 in order to produce a prediction of the form y^ =
1 1
0 + 1 x1 + 2 x2 for the input vector 2 ; 2 . What is the value of this prediction?
2. (6HW-13) Carry out the steps of Problem 1 above on the data of Problem
2 of Section A.18.
4. (5HW-14) Use all of MARS, thin plate splines, local kernel-weighted lin-
ear regression, and neural nets to …t predictors to both the noiseless and the
299
noisy "hat data." For those methods for which it’s easy to make contour or
surface plots, do so. Which methods seem most e¤ective on this particular
dataset/function?
1. a neural net with single hidden layer and M = 2 hidden nodes (and single
output node) using (u) = 1= (1 + exp (u)) and g (v) = v, and
2. a projection pursuit regression predictor with M = 2 summands gm (w0m x)
(based on cubic smoothing splines).
300
parameter (the "decay" parameter) you settle on for part b) and save the 100
predicted values for the 10 runs into 10 vectors. Make a scatterplot matrix of
pairs of these sets of predicted values. How big are the correlations between the
di¤erent runs?
d) Use the avNNet() function from the caret package to average 20 neural
nets with your parameters from part b).
9. (5E1-16) Below is a toy diagram for a very simple single hidden layer
"neural network" mean function of x 2 < (i.e. p = 1). Suppose that out-
puts/responses y are essentially 3 if x < 17 and essentially 8 if 17 < x < 20, and
essentially 3 if x > 20. Identify numerical values of neural network parameters
01 ; 11 ; 02 ; 12 ; 0 ; 1 ; 2 for which the corresponding predictor is a good ap-
proximation of the output mean function. (Here, (u) = 1= (1 + exp ( u)) and
g (z) = z.)
10. (5HW-16) Carry out the steps of Problem 7 of this section using the
dataset of Problem 13 of Section A.2.
301
Plot the SEL predictor of y implied by this set of …tted coe¢ cients, f^ (x).
jx jj j 1
K (x; j) =D for j = j = 1; 2; : : : ; 11
10
…rst for = :1 and then (in a separate plot) for = :01. Then make plots on
the a single set of axes the 11 normalized functions
K (x; j )
N j (x) =
P
11
K (x; l )
l=1
kx ij k i 1 j 1
K (x; ij ) =D for ij = ; i = 1; : : : ; 11 and j = 1; : : : ; 11
10 10
Make contour plots for K:1 (x; 6;6 ) and K:01 (x; 6;6 ). Then de…ne
K (x; ij )
N ij (x) =
P
11 P11
K (x; lm )
m=1 l=1
to these data for two di¤erent values of . Then de…ne normalized versions of
the radial basis functions as
K (x; xi )
N i (x) =
P
101
K (x; xm )
m=1
and redo the …tting using the normalized versions of the basis functions.
302
3. (5HW-14) Fit radial basis function networks based on the standard normal
pdf ,
81
X kx xi k
f (x) = 0 + iK (x; xi ) for K (x; xi ) =
i=1
to the data of Problem 2 Section A.16 for two di¤erent values of . Then de…ne
normalized versions of the radial basis functions as
K (x; xi )
N i (x) =
P
81
K (x; xm )
m=1
and redo the …tting using the normalized versions of the basis functions.
4. (6HW-15) Fit radial basis function networks based on the standard normal
pdf ,
51
X m 1 jx zj
f (x) = 0 + mK x; for K (x; z) =
m=1
50
to the dataset of Problem 17 of Section A.2 for two di¤erent …xed values of .
De…ne normalized versions of the radial basis functions as
K x; i501
N i (x) =
P
51
K x; m50 1
m=1
and redo the …tting using the normalized versions of the basis functions.
303
class 1).
304
nested sequence of sub-trees of Tree 5 below.
305
subtree.
Bootstrap Sample b1 b2
2; 3; 5; 5; 5; 6 7:000 7:000
2; 2; 3; 4; 5; 6 6:000 9:333
1; 1; 1; 1; 2; 6 :8000 10:000
1; 1; 3; 4; 5; 6 3:333 9:333
1; 1; 2; 3; 5; 6 3:500 8:000
1; 1; 2; 4; 4; 5 1:333 10:000
2; 2; 2; 3; 5; 5 4:000 6:000
2; 3; 3; 4; 4; 5 8:000 10:000
1; 2; 2; 2; 3; 4 3:200 12:000
2; 3; 4; 5; 6; 6 7:000 8:666
306
2. (6E1-19) All cases in a particular N = 100 training set are distinct/di¤erent.
Suppose that one is going to make "weighted bootstrap samples" of size 100,
using not equal weights of .01 on each case in the training set, but rather
P
100
weights/probabilities w1 ; w2 ; : : : ; w100 (where each wi > 0 and wi = 1).
i=1
a) What is the probability that a case with wi = :02 is included in a
particular weighted-bootstrap sample of size 100?
b) Suppose that for b = 1; 2; : : : ; B the corresponding weighted bootstrap
sample is T b and the sample mean of responses in this sample is y b . Further,
P
B
let ybag = B1 y b . Find an expression for lim ybag and argue carefully that
b=1 B!1
your expression is correct.
under a model where Eyi = xi . Use your value from a) to argue carefully
that this bias is clearly negative.
307
class 1). A bootstrap sample is made from this dataset and is indicated in the
table and by counts next to plotted points for those points represented in the
sample other than once. This sample is used to create a tree in a random forest
with 4 end nodes (accomplished by 3 binary splits). A random choice is made
for which of the 2 variables to split on at each opportunity and turns out to
produce the sequence "x1 then x1 then x2 ."
a) Identify the resulting tree by rectangles on the plot and provide the value
of y^ for each rectangle.
b) Which out-of-bag points are misclassi…ed by this particular tree?
x 1 2 3 4 5
y 1 4 3 5 6
308
For prior distributions, suppose that for Model 1 a are b are a priori in-
dependent with a U(0; 6) and b 1 Exp(1), while for Model 2, c Exp(1).
Further suppose that prior probabilities on the models are (1) = (2) = :5.
Compute (almost surely you’ll have to do this numerically) posterior means of
the quantities of interest in the two Bayes models, posterior probabilities for the
two models, and the overall predictor of y at x = 3.
P [y = 1jx = 1] =P [y = 0jx = 1]
and there are M = 2 models under consideration. We’ll suppose that joint
probabilities for (x; y) are as given in the tables below for the two models for
some p 2 (0; 1) and r 2 (0; 1)
Model 1 Model 2
ynx 0 1 ynx 0 1
1 :25 :25 1 (1 r) =2 r=2
0 (1 p) =2 p=2 0 :25 :25
so that under Model 1, the quantity of interest is :5=p and under Model 2, it is
r=:5. Suppose that under both models, training data (xi ; yi ) for i = 1; : : : ; N are
iid. For priors, suppose that in Model 1 a priori p Beta(2; 2) and suppose that
in Model 2 a priori r Beta(2; 2). Further, suppose that the prior probabilities
of the two models are (1) = (2) = :5.
Find the posterior probabilities of the 2 models, (1jT ) and (2jT ) and the
Bayes model average squared error loss predictor of P [y = 1jx = 1] =P [y = 0jx = 1].
(You may think of the training data as summarized in the 4 counts N(x;y) =number
of training vectors with value (x; y).)
4. (6E1-15) Below are tables specifying two discrete joint distributions for
(x; y) that we’ll call Model 1 and Model 2. Suppose that N = 2 training cases
309
(drawn iid from one of the models) are (x1 ; y1 ) = (2; 2) and (x2 ; y2 ) = (3; 3).
Model 1 Model 2
y x 1 2 3 y x 1 2 3
3 0 :125 :125 3 0 0 :1
2 0 :125 :125 2 :1 :2 :1
1 :125 :125 0 1 :1 :2 :1
0 :125 :125 0 0 :1 0 0
Suppose further that prior probabilities for the two models are 1 = :3 and
2 = :7.
a) Find the posterior probabilities of Models 1 and 2.
b) Find the "Bayes model averaging" SEL predictor of y based on x for
these training data. (Give values f^ (1) ; f^ (2) ; and ^f (3).)
a) Find the expected 0-1 loss for the individual classi…ers and for the "ma-
jority vote" classi…er. Note that the classi…ers are independent according to
this joint distribution.
310
b) Treat the vector of values of f1 ; f2 ; and f3 as "available data" and …nd
the conditional distributions of the vector given y = 0 and y = 1. What is in
fact the best function of these classi…ers in terms of expected expected 0-1 loss?
(Look again at Sections 1.4 and 1.5.) How does its error rate compare to the
error rates from a)?
Take f^0 (x) = y and using a "learning rate" of :5, …nd f^1 (x), the …rst boosted
iterate. (This will be f^0 (x) plus a multiple of one of the indicator functions.)
y 1 1 1 1 1 2 2 2 2 2
x1 1 3:5 4:5 6 1:5 8 3 4:5 8 2:5
x2 4 6:5 7:5 6 1:5 6:5 4:5 4 1:5 0
4. (6HW-13) Consider the famous Swiss Bank Note dataset. Use caret
train() to choose (via LOOCV) both AdaBoost.M1 and random forest 0-1
loss classi…ers based on these data. For a …ne grid of points indicate on a 2-d
plot which points get classi…ed to classes 1 and 1 so that you can make visual
comparisons.
5. (6HW-13) This problem concerns the "Seeds" dataset at the UCI Machine
Learning Repository. Standardize all p = 7 input variables before beginning
analysis.
a) Consider …rst the problem of classi…cation where only varieties 1 and 3
are considered (temporarily code variety 1 as 1 and variety 3 as +1) and use
311
only predictors x1 and x6 Use caret train() to choose (via LOOCV) both
AdaBoost.M1 and random forest 0-1 loss classi…ers based on these data. For
a …ne grid of points in [ 3; 3] [ 3; 3], indicate on a 2-d plot which points get
classi…ed to classes 1 and 1 so that you can make visual comparisons.
b) The paper "ada: An R Package for Stochastic Boosting" by Culp, John-
son, and Michailidis that appeared in the Journal of Statistical Software dis-
cusses using a one-versus-all strategy to move AdaBoost to a multi-class problem
known as the "AdaBoost.MH" algorithm. Continue the use of only predictors
x1 and x6 and …nd both an appropriate random forest classi…er and an Ad-
aBoost.MH classi…er for the 3-class problem with p = 2, and once more show
how the classi…ers break the 2-d input space up into regions of constant classi-
…cation.
c) How much better can you do at the classi…cation task using a random
forest classi…er based on all p = 7 input variables than you are able to do in
part b)? (Use LOOCV error rate to make your comparison.)
6. (6E2-13) Below is a toy K = 2 class training set for N = 4. Carry out ("by
hand") enough steps of the AdaBoost.M1 algorithm (…nd a number of iterations
M large enough) to produce a voting function with 0 training error rate. Plot
this function and indicate on the x axis which regions call for classi…cation to
the y = 1 class.
y 1 1 1 1
x 1 2 3 5
E y f^1 (x) ; E y f^2 (x) ; Var y f^1 (x) ; Var y f^2 (x) ;
(these expectations are across the joint distribution of (x; y) for the …xed training
set (and randomization for the random forest). Identify an approximately
optimizing
2
E y f^1 (x) + (1 ) f^2 (x)
312
Is the optimizer 0 or 1 (i.e. is the best linear combination of the two predictors
one of them alone)?
a) Beginning with f^0 (x) 7 and the …rst split of iteration 1 (for making
e^1 (x)) as indicated on the left …gure, draw in the 2nd split. Using it and a
= :5 learning rate, place the N = 6 values yi f^1 (xi ) onto the right …gure.
On that, mark the 2 cuts for creating e^2 (x).
b) Then, again using a = :5 learning rate and now your e^2 (x) implied by
the 2 cuts on the right …gure above, below show the regions on which f^2 (x) is
constant and indicate the values of f^2 (x) in those regions.
y 1 1 1 1 1
x 1:5 :5 :5 1:5 2:5
313
and base functions I [x < c] and I [x > c] 8c, suppose that one has a current
function version gm (x) = 3x. Derive the function gm+1 (x).
a) Find the SEL lasso coe¢ cient vector ^ optimizing SSE+8 ^lasso + ^lasso
1 2
lasso
and give the corresponding Y^ .
b) "Boost" your lasso SEL predictor from a) using ridge regression with
= 1 and a learning rate of = :1. Give the resulting vector of predictions
b o ost1
Y^ .
c) Why is it clear that the predictor in b) is a linear predictor? What is ^
b o ost1
such that Y^ = X ^?
d) Now "boost" your SEL lasso predictor from a) using a best "stump"
regression tree predictor (one that makes only a single split) and a learning rate
b o ost2
of = :1. Give the resulting vector of predictions Y^ .
314
e) After M iterations you won’t have an fM (x) taking only values 1 and
1 at every xi . How do you use fM (x) to do classi…cation?
12. (5HW-16) Use R and make a simple set of boosted predictions of home
price for the dataset of Problem 5 Section A.2 by …rst …tting a "default" random
forest (using randomForest), then correcting a fraction = :1 of the residu-
als predicted using a 7-nn predictor, then correcting a fraction = :1 of the
residuals predicted using a 1 component PLS predictor. Then permute the
orders in which you make these corrections and compare SSE for the 6 di¤erent
possibilities.
13. (5E2-14) Below are hypothetical counts from a small training set in a
2-class classi…cation problem with a single input, x 2 < (and we’ll treat x as
integer-valued). Although it is easy to determine what an approximately opti-
mal (0-1 loss) classi…er is here, instead consider use of the AdaBoost.M1 algo-
rithm to produce a classi…er. (Use "stumps"/two-node trees that split between
integer values as basis functions.) Find an M = 3 term version of the Ad-
aBoost.M1 voting function. (Give f^1 ; 1 ; f^2 ; 2 ; f^3 ; and 3 . The f^m s are of the
P3
form sign(x #) or sign(# x) and the …nal voting function is m=1 m ; f^m .)
315
Use the material of Section 12.1 to reproduce Figure 4.4 of HTF (color-coded
by group, with group means clearly indicated). Keep in mind that you may need
to multiply one or both of your coordinates by 1 to get the exact picture.
Identify an appropriate vector (u1 ; u2 ) and with your choice of vector, give the
function f (w) mapping < ! f1; 2; 3; 4g that de…nes this 4-class classi…er for the
case of 1 = 2 = 3 = 4 .
Class 1 2 3 4 5 6
Inner Product Pair (5; 0) ( 5; 0) (0; 3) (0; 3) (0; 0) (0; 0)
316
analysis for classi…cation among the 6 glass types. Then choose this number of
input variables by forward selection with the whole dataset. What are they?
c) Find the …rst 2 canonical coordinates for all 215 cases in the dataset. Plot
N = 215 ordered pairs of these using di¤erent plotting symbols for the K = 6
glass types. Overlay on this plot classi…cation regions based on LDA with these
…rst 2 canonical coordinates. Make a plot analogous to the plot in Figure 4.11
of HTF. (You may simply di¤erently color points on a …ne grid according to
which glass such a point would be classi…ed to.)
c) Make a version of the right …gure above with decision boundaries now
determined by using logistic regression as applied to the …rst two canonical vari-
ates. You will need to create a data frame with columns y; canonical variate
1; and canonical variate 2. Use the vglm function (in the VGAM package) with
family=multinomial() to do the logistic regression. Save the object created
by a command such as LR=vglm(insert formula, family=multinomial(),
data=data set). A set of observations can now be classi…ed to groups by us-
ing the command predict(LR, newdata, type=“response”), where newdata
317
contains the observations to be classi…ed. The outcome of the predict function
will be a matrix of probabilities. Each row contains the probabilities that a cor-
responding observation belongs to each of the groups (and thus sums to 1). We
classify to the group with maximum probability. As in b), do the classi…cation
for a …ne grid of points covering the entire area of the plot. You may again plot
the points of the grid, color-coded according to their classi…cation, instead of
drawing in the black lines.
d) So that you can plot results, …rst use the 2 canonical variates employed
thus far and use rpart in R to …nd a classi…cation tree with training error
rate comparable to the reduced rank LDA classi…er pictured on the left above.
Make a plot showing the partition of the region into pieces associated with the
11 di¤erent classes. (The intention here is that you show rectangular regions
indicating which classes are assigned to each rectangle, in a plot that might be
compared to the plots above and from Problem 1 of Section A.29.)
e) The Culp, Johnson, and Michailidis paper referred to in Problem 5 of
Section A.28 discusses using a one-versus-all strategy that moves AdaBoost to
a multi-class problem known as the "AdaBoost.MH" algorithm. Continue the
use of the …rst two canonical coordinates of the vowel training data and …nd
both an appropriate random forest classi…er and an AdaBoost.MH classi…er for
the 11-class problem with p = 2, and once more show how the classi…ers break
the 2-d space up into regions to be compared to other plots here.
f ) Beginning with the original vowel dataset (rather than with the …rst 2
canonical variates) and use rpart in R to …nd a classi…cation tree with training
error rate comparable to the classi…er in d). How much (if any) simpler/smaller
is the tree here than in d)?
of the form
p (x)
log = 0 + 1 x1 + 2 x2 + 3 x3 + 4 x4 + 5 x5
1 p (x)
was …t (via maximum likelihood) to the 700 training cases yielding results
^0 = 3:42; ^1 = 0:41; ^2 = 1:76; ^3 = 0:03; ^4 = 0:13; ^5 = 2:09
318
a) Treating the 700 subjects (that were used to …t the logistic regression
model) as a random sample of people of interest (which it is surely not) give a
linear function g (x) such that f^ (x) = I [g (x) > 0] is an approximately optimal
(0-1 loss) classi…er (y = 1 indicating response to the o¤er).
b) Continuing with the logistic regression model, properly adjust your an-
swer to a) to provide an approximately optimal (0-1 loss) classi…er for a case
where a fraction 1 = 1=1000 of all potential customers would respond to the
o¤er.
List the set of support vectors and evaluate the margin for your classi…er.
319
2. (5E2-14) Below is a cartoon representing the results of 3 di¤erent runs of
support vector classi…cation software on a set of training data representing K =
3 di¤erent classes in a problem with input space <2 . Each pair of classes was
used to produce a linear classi…cation boundary for classi…cation between those
two. (Labeled arrows tell which sides of the lines correspond to classi…cation
to which classes.) 7 di¤erent regions are identi…ed by Roman numerals on the
cartoon. Indicate values of an OVO (one-versus-one) classi…er f^OVO for this
situation. (For each region, identify decisions 1,2, or 3, or "?" if there is no
clear choice for a given region.)
2
for the (Gaussian) kernel K (x;z) = exp c kx zk . Make contour plots for
those functions g, and, in particular, show the g (x) = 0 contour that separates
[ 1:5; 1:0] [ :2; 1:3] into the regions where a corresponding SVM classi…es to
classes 1 and 1.
b) Have a look at the Culp, Johnson, and Michailidis paper referred to in
Problem 5 of Section A.28. It provides perspective and help with both the
ada and randomForest packages. Find both AdaBoost.M1 and random forest
classi…ers appropriate for the Ripley example. For a …ne grid of points in
[ 1:5; 1:0] [ :2; 1:3], indicate on a 2-d plot which points get classi…ed to classes
1 and 1 so that you can make visual comparisons to the SVM classi…ers referred
to in a).
2. (6E2-11) In what speci…c way(s) does the use of kernels and SVM method-
ology typically lead to identi…cation of a small number of important features
(basis functions) that are e¤ective in 2-class classi…cation problems?
320
3. (6HW-13) Consider again the Swiss Bank Note dataset of Problem 4 of
Section A.28. Use caret train() to choose (via LOOCV) to choose a SVM
based on Gaussian kernel and 0-1 loss for these data. Compare its training set
and cross-validation error rates to what you found in the Problem 4 of Section
A.28 and Problem 2 of Section A.30.
321
2
with the smallest. Make a plot showing the regions in (0; 1) where your best
neural net classi…er has f^ (x) = 1; 2; 3; and 4.
e) Use svm() in package e1071 to …t SVMs to the y = 1 and y = 2 training
data for the
"linear" kernel,
"polynomial" kernel (with default order 3),
"radial basis" kernel (with default gamma, half that gamma value, and
twice that gamma value)
Use the plot() function to investigate the nature of the 5 classi…ers. Put
the training data pairs on the plot using di¤erent symbols or colors for classes
1 and 2, and also identify the support vectors.
f ) Find SVMs (using the kernels indicated in d)) for the K = 4 class prob-
lem. Again, use the plot() function to investigate the nature of the 5 classi…ers.
Use the large test set to evaluate the (conditional) error rates for these 5 clas-
si…ers.
g) Use either the ada package or the adabag package and …t an AdaBoost.M1
classi…er to the y = 1 and y = 2 training data. Make a plot showing the regions
2
in (0; 1) where this classi…er has f^ (x) = 1 and 2. Use the large test set
to evaluate the conditional error rate of this classi…er. How does this error
rate compare to the best possible one for comparing classes 1 and 2 with equal
weights on the two? (You should be able to get the latter analytically.)
h) It appears from the Culp, Johnson, and Michailidis paper referred to in
Problem 5 of Section A.28 that ada implements a OVA version of a K-class
AdaBoost classi…er in R. Use this and …nd the corresponding classi…er. Make
2
a plot showing the regions in (0; 1) where this classi…er has f^ (x) = 1; 2; 3; and
4. Use the large test set to evaluate the conditional error rate of this classi…er.
a) On what basis might one expect that for large N this classi…er is approx-
imately optimal?
b) For what "voting function" g (x) is f^ (x) =sign(g (x))? Is this g (x) a
linear combination of radial basis functions?
c) Why will f^ (x) typically not be of the form of a support vector machine
based on a Gaussian kernel?
322
A.34 Section 14 Exercises
1. (6E2-11) The reduced rank classi…er of Problem 2 of Section A.29 can be
thought of as a "prototype classi…er." Give 4 prototypes (real numbers) that
can be though of as de…ning the classi…er.
a) Show that
R (x; z) = 1 + min (x; z)
is a reproducing kernel for this Hilbert space of functions.
323
b) Using Heckman’s development, describe as completely as possible
N Z 1 !
X 2 2
0
arg min (yi h (xi )) + (h (x)) dx
h2A i=1 0
a) For two di¤erent values of > 0 …nd the optimizing function h 2 A for
the criterion in part b) of Problem 1.
b) For two di¤erent values of > 0 …nd the optimizing function h 2 A for
the criterion in part c) of Problem 1.
You may use the fact that the least squares line through these data pairs is
y^ = 3:7 :5x. Find the optimizing f^ (x).
N
X
2 2 2
yi 0 + 1x + 2x + h (x) + khkA
i=1
324
(There are 4 di¤erent optimizations intended here for the two kernels and two
values of .) Plot the 4 resulting functions
2
0 + 1x + 2x + h (x)
on a single set of axes, together with the 145 original (x; y) data points. (If
this is computationally infeasible, you may reduce the size of the problem by
considering only the "last N " years in the dataset, where N doesn’t break your
computer.)
325
A.38 Section 17.1 Exercises
1. (6HW-13) There is a small fake dataset below. It purports to be a record
of 20 transactions in a drugstore where toothpaste, toothbrushes, and shaving
cream are sold. Assume that there are 80 other transaction records that include
no purchases of any toothpaste, toothbrush, or shaving cream.
a) Find I :02 (the collection of item sets with support at least :02).
b) Find all association rules derivable from rectangles in I :02 with con…dence
at least :5.
c) Find the association rule derivable from a rectangle in I :02 with the largest
lift.
326
for you. (Look under the Analyze->Multivariate menu.) In particular, it will
do both hierarchical and K-means clustering, and even self-organizing mapping
as an option for the latter. Consider
i) several di¤erent K-means clusterings (say with K = 9; 12; 15; 21),
ii) several di¤erent hierarchical clusterings based on 7-d Euclidean distance
(say, again with K = 9; 12; 15; 21 …nal clusters), and
iii) SOMs for several di¤erent grids (say 3 3; 3 5; 4 4; and 5 5).
Make some comparisons of how these methods break up the 210 data cases
into groups. You can save the clusters into the JMP worksheet and use the
GraphBuilder to quickly make plots. If you "jitter" the cluster numbers and
use "variety" for both size and color of plotted points, you can quickly get a
sense as to how the groups of data points match up method-to-method and
number-of-clusters-to-number-of-clusters (and how the clusters are or are not
related to seed variety). Also make some comparisons of the sums squared
Euclidean distances to cluster centers.
327
hierarchical clustering) using average linkage to cluster the 215 glass samples on
the basis of the 7 (standardized) inputs used there. For the case of 6 clusters
from each method, make a table giving counts of cases in a given cluster from
mclust and a given cluster from hclust. Then compute the "Rand index" for
comparing clusterings (look it up on Wikipedia).
328
3. (6HW-17) The 10 countries in the world with the largest populations are
China, India, United States, Indonesia, Brazil, Pakistan, Nigeria, Bangladesh,
Russia, and Mexico. You can …nd the (great circle) distances between their
capital cities using this online calculator:
http://www.chemical-ecology.net/java/capitals.htm Use multi-dimensional
scaling to make a 2-d representation of these cities intended to more or less pre-
serve great circle distances. (The pattern at
http:/www.personality-project.org.html might prove helpful to you.)
a) Find (for the …tted model) the ratio P [x = (1; 0; 1; 0)] =P [x = (0; 0; 0; 0)].
b) Find (for the …tted model) the conditional distribution of (x1 ; x2 ) given
that (x3 ; x4 ) = (0; 0). (You will need to produce 4 conditional probabilities.)
variable can take only integer values 1 through 10 and is probably not really an interval-
level variable in the …rst place (being more ordinal in nature). For purposes of exercise we
will ignore these matters, and treat the quality rating as a measured numerical response and
consider prediction under SEL.
329
a) Find sets of best (according to LOOCV) predictions for the quality ratings
for
k-nn prediction
elastic net prediction
PCR prediction
PLS prediction
MARS prediction (implemented in earth)
regression tree prediction
random forest prediction
boosted trees prediction
Cubist prediction
1. Use the remainder as a training set and the values of the 9 predictors (on
the remainder) as "features" in a MLR model (including intercept). Use
OLS to …t this to the outputs for the cases in the remainder.
2. Use the remainder as a training set and the values of the 9 predictors (on
the remainder) as "features" and …t a default random forest to the outputs
for the cases in remainder.
3. Apply the coe¢ cients from 1. to the 9 predictions to make an ensemble
prediction for each case in the fold.
330
4. Apply the random forest …t in 2. to the 9 predictions to make an ensemble
prediction for each case in the fold.
5. For both the predictions in 3. and 4. add the squared di¤erences between
outputs and predicted outputs across the fold.
6. Total the results of 5. across the 10 folds, divide by N , and take a square
root to get a "RMSPE" for the basic methods and parameters combined
through OLS and through a random forest.
Do the values "RMSPE" in 6. improve on what you have for the best of the
CVRMSPEs for the individual methods?
d) Discuss how you would get an "honest" CV assessment of likely perfor-
mance of the strategy of …rst …tting predictors using methods in a) obtaining
parameters from caret train() and then combining them via OLS MLR or
default random forest. Explain why the "RMSPE" values from c) are probably
too optimistic to serve the purpose here.
331
the AUC criterion57 , and
cross-validation (and OOB) 0-1 loss error rates.
y = I [y > 7]
Call a wine with rating 7 or better a "good" wine and this becomes a problem
of classi…cation of wines into "not good" and "good" ones.
a) Carry out the steps a) through i) in Problem 2 above (there referring to
the Glass-Identi…cation problem) for this wine classi…cation problem.
Consider the problem of combining basic classi…cation methodologies via
stacking/generalized stacking/meta-prediction/super-learning in the White Wines
classi…cation problem immediately above.
b) Use the outputs of your classi…ers developed in a) and the original input
variables (the 11 quality measures giving 9 + 11 "features" in total) as inputs to
a default random forest. (Where they are available, use estimated conditional
probabilities for class 2 rather than the classi…cation values assigned to the
training cases by the classi…ers.) What "training error rate" is produced for
0-1 loss? There is a nominal random forest "OOB error" rate associated with
your …nal "super-learner." Why should you NOT trust either of these numbers
as being indicative of the likely performance of the "tune 9 classi…ers and plug
their outputs into a default random forest" prediction methodology?
c) Say very clearly and carefully how (given plenty of computing power) you
would compute an honest assessment of the likely performance of the "super-
learner" described above.
5 7 You may, for example, use the pROC package to (plot the "ROC curve" and) compute this.
332