Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
135 views333 pages

Stat Learning Notes IV2

This document provides a summary of lecture notes on modern multivariate statistical learning. The notes were developed over several years teaching graduate-level courses on the topic. The content includes introductions to linear models, basis functions, smoothing splines, kernel methods, neural networks, and other techniques for supervised learning problems. The goal is to present the essential theoretical underpinnings and practical applications of a wide range of statistical learning methods.

Uploaded by

RaceaLine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
135 views333 pages

Stat Learning Notes IV2

This document provides a summary of lecture notes on modern multivariate statistical learning. The notes were developed over several years teaching graduate-level courses on the topic. The content includes introductions to linear models, basis functions, smoothing splines, kernel methods, neural networks, and other techniques for supervised learning problems. The goal is to present the essential theoretical underpinnings and practical applications of a wide range of statistical learning methods.

Uploaded by

RaceaLine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 333

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/338488412

Lecture Notes on Modern Multivariate Statistical Learning-Version IV

Preprint · November 2020

CITATIONS READS
0 293

1 author:

S.B. Vardeman
Iowa State University
124 PUBLICATIONS   1,466 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Statistical Machine Learning View project

Decision Theory for Sets and Sequences of Decision Problems View project

All content following this page was uploaded by S.B. Vardeman on 18 June 2021.

The user has requested enhancement of the downloaded file.


Lecture Notes on Modern Multivariate
Statistical Learning-Version IV
Stephen B. Vardeman
Analytics Iowa LLC
and
Iowa State University
June 18, 2021

Abstract
This set of notes is the most recent reorganization and update-in-
progress of Modern Multivariate Statistical Learning course material de-
veloped 2009-2020 over 7 o¤erings of PhD-level courses and 4 o¤erings of
an MS-level course in the Iowa State University Statistics Department, a
short course given in the Statistics Group at Los Alamos National Lab,
and two o¤ered through Statistical Horizons LLC. Early versions of the
courses were based mostly on the topics and organization of The Elements
of Statistical Learning by Hastie, Tibshirani, and Friedman, though very
substantial parts bene…ted from Izenman’s Modern Multivariate Statis-
tical Techniques, and from Principles and Theory for Data Mining and
Machine Learning by Clarke, Fokoué, and Zhang.

The present version bene…ts from a thougtful set of written comments


on an earlier iteration of the notes provided by Ken Ryan and Mark Culp,
incisive observations on the material and suggestions concerning what
I’ve said about it made by Max Morris and Huaiqing Wu during the MS-
level course we taught together Spring 2014, additional helpful critques
o¤ered by LANL statisticians in Summer 2016, and material from Bishop’s
Pattern Recognition and Machine Learning, Applied Predictive Modeling
by Kuhn and Johnson, and An Introduction to Statistical Learning by
James, Witten, Hastie, and Tibshirani. The work of a number of ISU
PhD and MS advisees inlcuding Jing Li, Wen Zhou, Cory Lanker, Andee
Kaplan, and Abhishek Chakraborty has also provided useful additional
content re‡ected in this version.

These notes have as prerequisites the Statistical Theory, Methods, and


Computing content of the …rst year courses in a Statistics MS program,
though presumably much of them can be understood with less background.

1
Contents

I Introduction, Generalities, and Some Background Ma-


terial 8
1 Overview/Context 8
1.1 Notation and Terminology . . . . . . . . . . . . . . . . . . . . . . 8
1.2 What is New Here (Particularly in Prediction)? . . . . . . . . . . 9
1.2.1 Matching Complexity to Training Set Information Content 9
1.2.2 The "Curse of Dimensionality" . . . . . . . . . . . . . . . 10
1.3 Some Initial Generalities About Prediction . . . . . . . . . . . . 12
1.3.1 Representing What is Known: Creating a Training Set for
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.2 Theoretically Optimal (Unrealizable) Predictors . . . . . 13
1.3.3 Nearest Neighbor Rules . . . . . . . . . . . . . . . . . . . 15
1.3.4 General Decomposition of the Expected Prediction Loss
for f^ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.5 A More Detailed Decomposition for Err in SEL Prediction
and Variance-Bias Trade-o¤ . . . . . . . . . . . . . . . . . 18
1.3.6 Approximating Err and Cross-Validation . . . . . . . . . 20
1.3.7 Choosing a Predictor Based on Cross-Validation . . . . . 23
1.3.8 Penalized Training Error Fitting and Choosing Complexity 24
1.4 Good Features and Prediction . . . . . . . . . . . . . . . . . . . . 25
1.4.1 Classi…cation Models and Optimal Features . . . . . . . . 25
1.4.2 Approximating "Partially Optimal" Numerical Features
for Discrete Parts of Input Vectors . . . . . . . . . . . . . 26
1.4.3 Abstract Feature Spaces (of Functions) and "Kernels" . . 28
1.4.4 Document Features and String Kernels for Text Processing 33
1.4.5 "Feature Engineering" and Data "Pre-processing": More
Perspective and Prediction of Predictor E¢ cacy . . . . . 36
1.5 Some More Generalities for 2-Class Classi…cation . . . . . . . . . 38
1.5.1 More on the Form of an Optimal 0-1 Loss Classi…er for
K=2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.5.2 Other Prediction Problems in 2-Class Classi…cation Models 40
1.5.3 "Voting Functions," Losses for Them, and Expected 0-1
Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.6 Density Estimation and Approximately Optimal and Naive Bayes
Classi…cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.7 Plotting to Portray the E¤ects of Particular Inputs in Prediction 49

2 Some Linear Theory, Linear Algebra, and Principal Compo-


nents 50
2.1 Inner Product Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.2 The (General) Gram-Schmidt Process and the QR Decomposi-
tion of a rank = p Matrix X . . . . . . . . . . . . . . . . . . . . 52

2
2.3 The Singular Value Decomposition of X . . . . . . . . . . . . . . 56
2.3.1 The Singular Value Decomposition and General Inner Prod-
uct Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4 Matrices of Centered Columns and Principal Components . . . . 59
2.4.1 "Ordinary" Principal Components . . . . . . . . . . . . . 59
2.4.2 "Kernel" Principal Components . . . . . . . . . . . . . . . 63
2.4.3 Graphical (Spectral) Features . . . . . . . . . . . . . . . . 64

II Supervised Learning I: Basic Prediction Methodol-


ogy 66
3 (Non-OLS) SEL Linear Predictors 66
3.1 Ridge Regression, the Lasso, and Some Other Shrinking Methods 67
3.1.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . 67
3.1.2 The Lasso, Etc. . . . . . . . . . . . . . . . . . . . . . . . . 71
3.1.3 Least Angle Regression (LAR) . . . . . . . . . . . . . . . 76
3.2 Two Methods With Derived Input Variables . . . . . . . . . . . . 79
3.2.1 Principal Components Regression . . . . . . . . . . . . . . 79
3.2.2 Partial Least Squares Regression . . . . . . . . . . . . . . 80

4 Linear SEL Prediction Using Basis Functions 83


4.1 p = 1 Wavelet Bases . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2 p = 1 Piecewise Polynomials and Regression Splines . . . . . . . 87
4.3 Basis Functions and p-Dimensional Inputs . . . . . . . . . . . . . 89
4.3.1 Multi-Dimensional Regression Splines (Tensor Product Bases) 89
4.3.2 MARS (Multivariate Adaptive Regression Splines) . . . . 90

5 Smoothing Splines and SEL Prediction 92


5.1 p = 1 Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . . 92
5.2 Multi-Dimensional Smoothing Splines . . . . . . . . . . . . . . . 97
5.3 An Abstraction of the Smoothing Spline Material and Penalized
Fitting in <N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4 Graph-Based Penalized Fitting/Smoothing (and Semi-Supervised
Learning) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6 Kernel and Local Regression Smoothing Methods and SEL Pre-


diction 102
6.1 One-dimensional Kernel and Local Regression Smoothers . . . . 102
6.2 Local Regression Smoothing in p Dimensions . . . . . . . . . . . 105

7 High-Dimensional Use of Low-Dimensional Smoothers and SEL


Prediction 106
7.1 Structured Regression Functions . . . . . . . . . . . . . . . . . . 106
7.1.1 Additive Models . . . . . . . . . . . . . . . . . . . . . . . 106
7.1.2 Other Structured Regression Forms . . . . . . . . . . . . . 107

3
7.2 Projection Pursuit Regression . . . . . . . . . . . . . . . . . . . . 108

8 Highly Flexible Non-Linear Parametric Prediction Methods 108


8.1 Neural Network Regression . . . . . . . . . . . . . . . . . . . . . 108
8.2 Neural Network Classi…cation . . . . . . . . . . . . . . . . . . . . 110
8.3 Fitting Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 111
8.3.1 The Back-Propagation Algorithm . . . . . . . . . . . . . . 111
8.3.2 Formal Regularization of Fitting . . . . . . . . . . . . . . 114
8.4 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . 115
8.5 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . 118
8.6 Radial Basis Function Networks . . . . . . . . . . . . . . . . . . . 119

9 Prediction Methods Based on Rectangles: Trees and PRIM 120


9.1 Regression and Classi…cation Trees (CART) . . . . . . . . . . . . 121
9.1.1 Regression Trees . . . . . . . . . . . . . . . . . . . . . . . 121
9.1.2 Classi…cation Trees . . . . . . . . . . . . . . . . . . . . . . 123
9.1.3 Optimal Subtrees . . . . . . . . . . . . . . . . . . . . . . . 124
9.1.4 Measuring the Importance of Inputs for Tree Predictors . 127
9.2 PRIM (Patient Rule Induction Method) . . . . . . . . . . . . . . 127

10 Predictors Built on Bootstrap Samples 129


10.1 Bagging in General . . . . . . . . . . . . . . . . . . . . . . . . . . 129
10.2 Random Forests: Special Bagging of Tree Predictors . . . . . . . 131
10.3 Measuring the Importance of Inputs for Bagged Predictors . . . . 133
10.3.1 The Boruta Wrapper/Heuristic for Variable Selection . . 134
10.4 Bumping and "Active Set Selection" . . . . . . . . . . . . . . . . 135

11 "Ensembles" of Predictors 136


11.1 Bayesian Model Averaging for Prediction . . . . . . . . . . . . . 136
11.2 Stacking: SEL ... and 0-1 Loss . . . . . . . . . . . . . . . . . . . 138
11.3 "Generalized Stacking" and "Deep" Structures for Prediction . . 140
11.4 Boosting/Successive Approximation . . . . . . . . . . . . . . . . 143
11.4.1 SEL Boosting . . . . . . . . . . . . . . . . . . . . . . . . . 143
11.4.2 General "Gradient Boosting" . . . . . . . . . . . . . . . . 144
11.4.3 Some Issues Related to Boosting Practice . . . . . . . . . 147
11.4.4 AdaBoost.M1 . . . . . . . . . . . . . . . . . . . . . . . . . 148
11.5 Quinlan’s Cubist and "Divide and Conquer" Strategies . . . . . . 152

III Intermission: Perspective and Prediction in Prac-


tice 154

IV Supervised Learning II: More on Classi…cation and


Additional Theory 156

4
12 Basic Linear (and a Bit on Quadratic) Methods of Classi…ca-
tion 156
12.1 Linear (and a bit on Quadratic) Discriminant Analysis . . . . . . 157
12.1.1 Dimension Reduction in LDA . . . . . . . . . . . . . . . . 159
12.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 161
12.3 Separating Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . 165

13 Support Vector Machines 166


13.1 The Linearly Separable Case: Maximum Margin Classi…ers . . . 166
13.2 The Linearly Non-separable Case: Support Vector Classi…ers . . 170
13.3 SV Classi…ers and Kernels: Support Vector Machines . . . . . . . 173
13.3.1 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
13.3.2 A Penalized-Fitting Function-Space Optimization Argu-
ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.3.3 A Function-Space-Support-Vector-Classi…er Geometry Ar-
gument . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
13.3.4 Some Perspective on SVMs . . . . . . . . . . . . . . . . . 178
13.4 Other Support Vector Methods . . . . . . . . . . . . . . . . . . . 179

14 Prototype and (More on) Nearest Neighbor Methods of Clas-


si…cation 181

15 Reproducing Kernel Hilbert Spaces: Penalized/Regularized and


Bayes Prediction 184
15.1 RKHSs and p = 1 Cubic Smoothing Splines . . . . . . . . . . . . 184
15.2 What is Possible Beginning from Linear Functionals and Linear
Di¤erential Operators for p = 1 . . . . . . . . . . . . . . . . . . . 185
15.3 What Is Common Beginning Directly From a Kernel . . . . . . . 187
15.3.1 Reprise of Some Special Cases . . . . . . . . . . . . . . . 192
15.3.2 Addendum Regarding the Structures of the Spaces Re-
lated to a Kernel . . . . . . . . . . . . . . . . . . . . . . . 193
15.4 Gaussian Process "Priors," Bayes Predictors, and RKHSs . . . . 194

16 More on Understanding and Predicting Predictor Performance196


16.1 Optimism of the Training Error . . . . . . . . . . . . . . . . . . . 197
16.2 Cp , AIC and BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
16.2.1 Cp and AIC . . . . . . . . . . . . . . . . . . . . . . . . . . 198
16.2.2 BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
16.3 Cross-Validation Estimation of Err . . . . . . . . . . . . . . . . . 200
16.4 Bootstrap Estimation of Err . . . . . . . . . . . . . . . . . . . . . 201

V Unsupervised Learning Methods 201

5
17 Some Methods of Unsupervised Learning 202
17.1 Association Rules/Market Basket Analysis . . . . . . . . . . . . . 202
17.1.1 The "Apriori Algorithm" and Use of its Output . . . . . . 204
17.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
17.2.1 Partitioning Methods ("Centroid"-Based Methods) . . . . 207
17.2.2 Hierarchical Methods . . . . . . . . . . . . . . . . . . . . 208
17.2.3 (Mixture) Model-Based Methods . . . . . . . . . . . . . . 210
17.2.4 Biclustering . . . . . . . . . . . . . . . . . . . . . . . . . . 211
17.2.5 Self-Organizing Maps . . . . . . . . . . . . . . . . . . . . 214
17.3 Multi-Dimensional Scaling . . . . . . . . . . . . . . . . . . . . . . 218
17.4 More on Principal Components and Related Ideas . . . . . . . . 220
17.4.1 "Sparse" Principal Components . . . . . . . . . . . . . . . 220
17.4.2 Non-negative Matrix Factorization . . . . . . . . . . . . . 221
17.4.3 Archetypal Analysis . . . . . . . . . . . . . . . . . . . . . 222
17.4.4 Independent Component Analysis . . . . . . . . . . . . . 222
17.4.5 Principal Curves and Surfaces . . . . . . . . . . . . . . . . 225
17.5 (Original) Google PageRanks . . . . . . . . . . . . . . . . . . . . 228

VI Miscellanea 230
18 Graphs as Representing Independence Relationships in Multi-
variate Distributions 230
18.1 Some Considerations for Directed Graphical Models . . . . . . . 231
18.2 Some Considerations for Undirected Graphical Models . . . . . . 233
18.2.1 Restricted Boltzmann Machines . . . . . . . . . . . . . . . 235

19 Special Bayes Methods for Statistical Learning 240


19.1 Relevance Vector Machines . . . . . . . . . . . . . . . . . . . . . 240
19.2 Dirichlet and Data-Derived Priors for Prediction Based on Nor-
mal Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . 242
19.3 Bayes Mixture Analyses for Binary Vectors . . . . . . . . . . . . 244

VII Appendices 245


A Exercises 245
A.1 Section 1.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 245
A.2 Section 1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 247
A.3 Section 1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 261
A.4 Section 1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 267
A.5 Section 1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 270
A.6 Section 2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 271
A.7 Section 2.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 272
A.8 Section 2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 274
A.9 Section 2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 276

6
A.10 Section 3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 281
A.11 Section 3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 285
A.12 Section 4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 287
A.13 Section 4.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 289
A.14 Section 4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 290
A.15 Section 5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 291
A.16 Section 5.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 292
A.17 Section 5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 293
A.18 Section 6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 293
A.19 Section 6.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 298
A.20 Section 7.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 299
A.21 Section 8.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 299
A.22 Section 8.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 302
A.23 Section 9.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 303
A.24 Section 10.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 306
A.25 Section 10.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 307
A.26 Section 11.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 308
A.27 Section 11.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 310
A.28 Section 11.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 311
A.29 Section 12.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 315
A.30 Section 12.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 317
A.31 Section 13.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 319
A.32 Section 13.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 319
A.33 Section 13.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 320
A.34 Section 14 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 323
A.35 Section 15.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 323
A.36 Section 15.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 324
A.37 Section 15.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 325
A.38 Section 17.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 326
A.39 Section 17.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 326
A.40 Section 17.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 328
A.41 Section 18.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . 329
A.42 "General/Comprehensive" Exercises . . . . . . . . . . . . . . . . 329

7
Part I
Introduction, Generalities, and
Some Background Material
1 Overview/Context
1.1 Notation and Terminology
These notes are about "statistics for ‘big data’" (AKA "machine learning" and
"data analytics"). We begin with the standard statistical notation and set-up
where one has data from N cases on p or p + 1 variables, x1 ; x2 ; : : : ; xp and
possibly y portrayed below:

Variables
x11 x12 x1p y1
x21 x22 x2p y2
Cases .. .. .. .. ..
. . . . .
xN 1 xN 2 xN p yN

In statistical machine learning, this dataset is typically called the training


dataset and we’ll call it T . Variables are often referred to as features, and
cases are sometimes called instances. We’ll use standard matrix (and linear
models) notation, beginning with
0
xi = (xi1 ; xi2 ; : : : ; xip )

for the case/row i set of x values (in column vector form unless otherwise indi-
cated) and
0 0 1 0 1
x1 y1
B x02 C B y2 C
B C B C
X =B . C; Y = B .. C ; and T = (X; Y )
N p @ .. A N 1 @ . A
x0N yN

for the training data.


As in all of statistics, the basic objective is identifying, describing, and
enabling the practical use of simple (low-dimensional/low-order) structure rep-
resented in the N p or N (p + 1) data array. Most "classical" statistical
methods are implicitly aimed at situations where

both N and p are small (data are scarce), and


quanti…cations of the level of information the data provide about parame-
ters of a probability model are central.

8
Here we treat cases where at least one of N or p can be large and there is little
fundamental interest in model parameters or exactly how much we know about
them.
The ambivalence toward making statements concerning parameters of any
probability models employed (including those that would describe the amount
of variation in observables the model generates) is a fundamental di¤erence be-
tween a machine learning point of view and that common in basic graduate
statistics courses. This posture is perhaps sensible enough, as careful examina-
tion of a large training set will usually show that standard (tractable) probability
models are highly imperfect descriptions of complex situations.
Standard versions of problems addressed here are:

supervised learning problems1 , where there is a response/output vari-


able or target, y, and the problem is one of …nding a function of p inputs
x, f (x), that approximates y. When the form of f depends on the train-
ing set, we’ll write yb = fb(x). Where y is a measured/continuous variable
the problem is typically called prediction. Where y takes values in a
…nite set like f1; 2; : : : ; Kg, the problem is typically called classi…cation
(or sometimes pattern recognition). In these notes, we will on occasion
wish to refer simultaneously to both standard forms of supervised learning
and may then treat the word "prediction" as including both cases.
unsupervised learning problems, where there is no response variable.
The general objective here is then to identify relationships among the p
variables x or commonalities in segments of the N cases, i.e. interpretable
low-order structure in the data. Standard versions of this are clustering,
principal components analysis, and multi-dimensional scaling.

1.2 What is New Here (Particularly in Prediction)?


Reasonable questions here are "What is the big deal?" and "What new issues
arise in ‘statistics for big data’?"
Where N and/or p are large, limitations on computing time or computer
memory can make straightforward implementation of standard methods imprac-
tical or even impossible. Sometimes more clever implementations (for example
employing parallelization or use of specialized hardware) make application of
standard methods feasible. (These are matters that won’t be much considered
in these notes.) At other times, new methods need to be developed.

1.2.1 Matching Complexity to Training Set Information Content


Where N is big and p is small, standard statistical prediction methods (like
multiple linear regression) will produce precisely …t but relatively crude pre-
dictors ... whose forms, while perhaps adequate as …rst approximations to a
real relationship between x and y (and about all that can be …t based on a
1 In this context the input variables x are "covariates" in standard statistical parlance.

9
small dataset), fail to really make full use of the available information. There
is the possibility of either increasing "p" by (implicitly or explicitly) building
additional features from existing ones and/or simply using more sophisticated
and ‡exible forms for prediction (that go beyond, for example, the basic linear
form in the input variables of multiple linear regression). But there is also
the potential to "over-do" and e¤ectively make p too large or the predictor too
‡exible. One must somehow match predictor complexity to the real
information content of a (large) training set. It is this need and the
challenge it represents that makes the area interesting and important.

1.2.2 The "Curse of Dimensionality"


If p is at all big, <p is "huge" and our intuition about how many cases would
be required to "…ll up" even an intuitively small part of p-space is very poor.
Essentially any dataset with large p is necessarily "sparse." There are many
ways of framing this inescapable sparsity. Some simple ones involve facts about
uniform distributions on the unit ball in <p

fx 2 <p j kxk 1g

and on a unit cube centered at the origin


p
[ :5; :5]

For one thing, "most" of these distributions are very near the surface of the
p
solids. The cube [ r; r] capturing (for example) half the volume of the cube
(half the probability mass of the distribution) has
1=p
r = :5 (:5)

which converges to :5 as p increases. Essentially the same story holds for the
uniform distribution on the unit ball. The radius capturing half the probability
mass has
1=p
r = (:5)
which converges to 1 as p increases. Points uniformly distributed in these
regions are mostly near the surface or boundary of the spaces.
Another interesting calculation concerns how large a sample must be in order
for points generated from the uniform distribution on the ball or cube in an iid
fashion to tend to "pile up" anywhere. Consider the problem of describing the
distance from the origin to the closest of N points drawn iid uniformly from the
p-dimensional unit ball. With

R = the distance from the origin to a single random point ,

R has cdf 8
< 0 r<0
F (r) = rp 0 r 1
:
1 r>1

10
So if R1 ; R2 ; : : : ; RN are iid with this distribution, M = min fR1 ; R2 ; : : : ; RN g
has cdf 8
< 0 m<0
N
FM (m) = 1 (1 mp ) 0 m 1
:
1 m>1
This distribution has, for example, median
!1=p
1=N
1 1
FM (:5) = 1
2

For, say, p = 100 and N = 106 , the median of the distribution of M is :87.
In retrospect, it’s not really all that hard to understand that even "large"
sets of points in <p must be sparse. After all, it’s perfectly possible for p-vectors
u and v to agree perfectly on all but one coordinate and be far apart in p-space.
There simply are a lot of ways for two p-vectors to di¤er!
In addition to these kinds of considerations of sparsity, there is the fact
that the potential complexity of functions of p variables explodes exponentially
in p. CFZ point this out and go on to note that for large p, all datasets
exhibit multicollinearity (or its generalization to non-parametric …tting) and its
accompanying problems of reliable …tting and extrapolation. These and related
issues together constitute what is often called the curse of dimensionality.
The curse implies that even "large" N doesn’t save one and somehow make
practical large-p prediction only a trivial application of standard parametric or
non-parametric regression methodology. And when p is large, it is essentially
guaranteed that if one uses a method that is "too" ‡exible in terms of the rela-
tionships between x and y that it permits, one will be found, real/fundamental/
reproducible or not. That is, the (common for large p) possibility that a dataset
is (sparse and) not really adequate to support the use of a (‡exible) supervised
statistical learning method can easily lead to over…tting. This is the pres-
ence of what appears to be a strong pattern in a (sparse) training set that
generalizes/extrapolates poorly to cases outside the training set.
In light of the foregoing, one standard way of choosing among various "big
data" statistical procedures for a given dataset is to de…ne both 1) a reliable
measure of estimated/predicted performance (like an estimated prediction mean
square error) and 2) a measure of complexity (like an "e¤ective number of …tted
parameters") for a predictor. Then one attempts to optimize (by choice of
complexity measure) the predicted performance. In light of the over…tting
issue, the method predicting performance almost always employs some form of
"holdout" sample, whereby performance is evaluated using data not employed
in …tting/predictor development.2
2 This approach potentially addesses the detection of both over…tting and "model bias"

(where a …tted form is simply not adequate to represent the relationship between input vari-
ables and a target).

11
1.3 Some Initial Generalities About Prediction
1.3.1 Representing What is Known: Creating a Training Set for
Prediction
We began exposition with an N (p + 1) data matrix conceptually already in
hand. It is important to say that in real predictive analytics problems, the
reduction of all information available and potentially relevant to explaining y to
values of p predictor variables3 (that encode relevant "features" of the N cases) is
an essential and highly critical activity. If one de…nes good features/variables
(ones that e¤ectively and parsimoniously represent the N cases), then sound
statistical methodology has a chance of being practically helpful. Poor initial
choice of features limits how well one can hope to do in prediction.
This is particularly important to bear in mind where information from many
disparate databases or sources is used to create the training set/data matrix T
available for statistical analysis. In this way, in many applications of modern
data analytics the hard work begins substantially before the formal technical
subjects addressed in these notes come into play, and the quality of the work in
those initial steps is critical to ultimate success. All that follows in these
notes takes the particular form of training set adopted by a data
analyst as given, and that choice governs and limits what is possible
in terms of e¤ective prediction.
We should also note that in a typical analytics problem, variables represented
by the columns of a data matrix are in di¤erent units and often represent con-
ceptually di¤erent kinds of quantities (e.g., one might represent a voltage while
another represents a distance and another represents a temperature). In some
kinds of analyses this is completely natural and causes no logical problems. But
in others (particularly ones based on inner products of data vectors or distances
between them and/or where sizes of multipliers of particular variables in a linear
combination of those variables are important) one gets fundamentally di¤erent
results depending upon the scales used.
One surely doesn’t want to be in the position of having ultimate predictions
depend upon whether a distance (represented by a coordinate of x) is expressed
in km or in nm. And the whole notion of the <2 distance between two data
vectors where the …rst coordinate of each is a voltage
q and the second is a tem-
2 2
perature seems less than attractive. (What is (3 kV) + (2 K) supposed to
mean?)
A sensible approach to eliminating logical di¢ culties that arise in using
methods where scaling/units of variables matters, is to standardize predictors
x (and center any quantitative response variable, y) before beginning analysis.
That is, if a raw feature x has in the training set a sample standard deviation4
3 This is at least one common meaning of the term "data mining."
4 While it doesn’t really matter which one uses, the "N " divisor in
p place of the "N 1"
divisor seems slightly simpler as it makes the columns have <N norm N as opposed to norm
p
N 1.

12
sx and a sample mean x, one replaces it with a feature
x x
x0
sx
(thereby making all features unit-less). Conclusions about standardized input
x0 and centered response y 0 = y y then translate naturally to conclusions about
the raw variables via

x = sx x0 + x and y = y 0 + y

1.3.2 Theoretically Optimal (Unrealizable) Predictors


In the context of supervised learning and the objective of choosing f (x) to
track y, suppose that P is a ((p + 1)-dimensional) distribution for (x; y) and
L (^
y ; y) 0 is a loss function for penalizing prediction/classi…cation y^ when y
holds. Let "E" be the P expectation operator. Unless speci…cally noted to
the contrary, all expectations refer to distributions and conditional distributions
derived from P . In general, if we need to remind ourselves what variables are
being treated as random in probability, expectation, conditional probability, or
conditional expectation computations, we will superscript E or P with their
names.) Write E[ jx] for conditional expectation and Var[ jx] for conditional
variance (based on the conditional distribution of yjx) derived from P .
As a thought experiment (not yet as anything based on the training data)
consider choosing a functional form f (NOT yet f^) to minimize risk (or "pre-
diction error")
EL (f (x) ; y)
In theory (given P ) this is "easy." One writes the expectation in iterated fashion,

EL (f (x) ; y) = EE [L (f (x) ; y) jx]

and notes that an optimal f (x) is thus

f (x) = arg minE [L (a; y) jx] (1)


a

the action/prediction that minimizes conditional (on the value of x) expected


(over y) loss.

SEL In the simple case of squared error loss, i.e. where


2
L (b
y ; y) = (b
y y)

an optimal f in display (1) is then well-known to be

f (x) = E [yjx]

the conditional mean of yjx.

13
Classi…cation In a classi…cation context, where y takes values in G = f1; 2; : : : ; Kg
(or, completely equivalently, G = f0; 1; : : : ; K 1g), one might use the (0-1) loss
function
L (b
y ; y) = I [b y 6= y]
An optimal f corresponding to form (1) is then
X
f (x) = arg min P [y = vjx]
a
v6=a

= arg maxP [y = ajx]


a
= arg maxP [y = a] p (xja) (2)
a

(where p (xjy) is a density for the class-conditional distribution of xjy).


A simple generalization of 0-1 loss is one that for di¤erent values of y charges
potentially di¤erent losses ly 0 when y^ 6= y, that is,
L (b
y ; y) = ly I [b
y 6= y]
Essentially the same argument as above implies that an optimal f for this pos-
sibly asymmetric loss is
X
f (x) = arg min lv P [y = vjx]
a
v6=a

= arg max la P [y = ajx]


a
= arg max la P [y = a] p (xja)
a

Another Problem in Classi…cation Models In a classi…cation model as


immediately above, one might have in mind assessment of the set of likelihoods
that y = k based on x. That would call for the making of a K-dimensional
predictor y
^ and appropriate de…nition of a loss. One simple possible loss is a
sum of squared errors
K
X 2
L (b
y ; y) = (^
yk I [y = k])
k=1

for which it is easy to show that an optimal vector predictor is


f (x) = (P [y = jx] ; P [y = 2jx] ; : : : ; P [y = Kjx]) (3)
For many purposes (see, e.g., Section 8.2) a more appropriate loss is the
"cross-entropy loss"
XK
L (b
y ; y) = I [y = k] ln y^k (4)
k=1
What is perhaps not immediately obvious is that a Lagrange multiplier argu-
ment shows that subject to the constraint that the y^k are positive and sum to
1, the vector predictor (3) is also optimal for cross-entropy loss (4).

14
1.3.3 Nearest Neighbor Rules
One idea that create a spectrum of predictor ‡exibilities (including extremely
high ones) is to operate completely non-parametrically and to think that if N
is "big enough," something like
1 X
yi (5)
# of xi = x
i with
xi =x

might work as an approximation to E[yjx] in SEL prediction problems and


1 X
I [yi = a] (6)
# of xi = x
i with
xi =x

might work as an approximation for P [y = ajx] in a K-class classi…cation model.


But almost always (unless N is huge and the distribution of x is discrete) the
number of xi = x is 0 or at most 1 (and absolutely no extrapolation beyond the
set of training inputs is possible). So some modi…cation is typically required.
The condition
xi = x
might be replaced with
xi x
in expression (5) and/or (6).
One form of this is to …rst de…ne for each x the "k-neighborhood"

nk (x) = the set of k inputs xi in the training set closest to x in <p

A k-nearest neighbor (k-nn) approximation to E[yjx] is then


1 X
m (x) = yi
k
i with
xi 2nk (x)

suggesting the SEL prediction rule

fb(x) = m (x)

Similarly, a k-nearest neighbor approximation to an optimal 0-1 loss classi…ca-


tion rule in a K-class classi…cation model is
X
fb(x) = arg max I [yi = a]
a
i with
xi 2nk (x)

One might hope that upon allowing k to increase with N (provided that P is
not too bizarre–one is counting, for example, on the continuity of E[yjx] in x)
these could be e¤ective predictors. They are surely (for small k) highly ‡exible

15
predictors and they and things like them often fail to be e¤ective because of
the curse of dimensionality. (In high dimensions, k-neighborhoods are almost
always huge in terms of their extent. There are simply too many ways that a
pair of training inputs xi can di¤er.
It is worth noting that
1 X
ma (x) = I [yi = a]
k
i with
xi 2nk (x)

is a k-nearest neighbor approximation to

E [I [y = a] jx] = P [y = ajx]

in a K-class model, and for some purposes knowing this is more useful than
knowing the 0-1 loss k-nn classi…cation rule.
Ultimately, one should view the k-nn idea as an important, almost decep-
tively simple, and highly useful one. k-nn rules are approximately optimal
predictors (for both SEL and 0-1 loss problems) that span a full spectrum of
complexities/‡exibilities speci…ed by the simple parameter k (the neighborhood
size). Whether or not they can be e¤ective in a given application depends upon
the size of p and N and the extent to which there is some useful structure latent
in the distribution of xs in the input space (mitigating the e¤ects of the curse
of dimensionality).

1.3.4 General Decomposition of the Expected Prediction Loss for f^


Now suppose that the training data (xi ; yi ) for i = 1; : : : ; N are iid according
to P , independent of a single (x; y) that is also P distributed.5 Write ET for
averaging with respect to P N (i.e. for averaging out over the training data),
and E(x;y) for averaging with respect to the distribution P of (x; y).
For f^ a predictor based on the training data T (a function of both x and
T ), a measure of average e¤ectiveness of f^ is the prediction error6
h i
Err ET E(x;y) L f^ (x) ; y (7)

If f (x) is a theoretically optimal predictor of y under loss L and joint distri-


bution P , training set T is used to select a function (say gT ) from a class of
functions S = fgg having expected loss E(x;y) L (g (x) ; y) < 1, and ultimately
one uses as a predictor
f^ (x) = gT (x) ,
the situation is as in the cartoon in Figure 1, where g is a minimizer of
E(x;y) L (g (x) ; y) across g 2 S.
5 We will typically abuse notation and write (x; y) instead of (x0 ; y) despite the fact that

by convention x is a column vector.


6 This quantity is sometimes called the "test error" and "generalization error" and these

names will also be used in our exposition and problem sets.

16
Figure 1: Optimal, Restricted Optimal, and Fitted Predictors

The optimal f (x) is potentially (likely) outside of S. The "closest" one


can get to it inside of S is g , and lacking full knowledge of P one can only
approximate this best element of S by the random (as the choice depends upon
the training set T ) f^ = gT (that is no better than g for any training set!).
So, since here

Err = ET E(x;y) L f^ (x) ; y = ET E(x;y) L (gT (x) ; y)

we have

Err = E(x;y) L (f (x) ; y) + E(x;y) L (g (x) ; y) E(x;y) L (f (x) ; y)

+ ET E(x;y) L (gT (x) ; y) E(x;y) L (g (x) ; y) (8)

This says that Err for the training-set-dependent predictor is the sum of three
terms. The …rst is the minimum possible error. The second is the non-negative
di¤erence between the best that is possible using a predictor constrained to be
an element of S and the absolute best that is possible. The third is the non-
negative di¤erence between Err (that involves averaging over the training-set-
directed random choices of elements from S, none of which can have average
loss over (x; y) better than that of g ) and the best that is possible using a
predictor constrained to be an element of S (namely the average loss of g (x)).
So relationship (8) might be rewritten as

Err = minimum expected loss possible + modeling penalty


+ …tting penalty

Err can be in‡ated because S is too small (inducing model bias) or because
the sample size and/or …tting method are inadequate to make gT consistently
approximate g .

17
1.3.5 A More Detailed Decomposition for Err in SEL Prediction and
Variance-Bias Trade-o¤
In the context of squared error loss, a more detailed decomposition of Err pro-
vides additional insight into the di¢ culty faced in building e¤ective predictors.
Note that a measure of the e¤ectiveness of the predictor f^ at x (under
squared error loss) is what we might call
2
Err (x) ET E f^ (x) y jx (9)

For some purposes, other conditional versions of Err might be useful and ap-
propriate. For example
2
ErrT E(x;y) f^ (x) y

is another kind of prediction or test error (that is a function of the training


data). (What one would surely like–but surely cannot have–is a guarantee that
ErrT is small uniformly in T .) Note that in these notations what we have
called
Err Ex Err (x) = ET ErrT (10)
is a number, an expected squared di¤erence between target and prediction.
In any case, a useful decomposition of Err(x) in display (9) is
2 h i
2
Err (x) = ET f^ (x) E [yjx] + E (y E [yjx]) jx
2 2
= ET f^ (x) ET f^ (x) + ET f^ (x) E [yjx] + Var [yjx]
2
= VarT f^ (x) + ET f^ (x) E [yjx] + Var [yjx] (11)

The …rst quantity in this decomposition, VarT f^ (x) , is the variance of the
2
prediction at x. The second term, ET f^ (x) E [yjx] , is a kind of squared
bias of prediction at x. And Var[yjx] is an unavoidable variance in outputs
at x. Highly ‡exible prediction forms may give small prediction biases at the
expense of large prediction variances. One may need to balance the two o¤
against each other when looking for a good predictor.
Now from expressions (10) and (11)

Err = Ex Err (x)


2
= Ex VarT f^ (x) + Ex ET f^ (x) E [yjx] + Ex Var [yjx] (12)

The …rst term on the right here is the average (according to the marginal of x)
of the prediction variance at x. The second is the average squared prediction

18
bias. And the third is the average conditional variance of y (and is not under
the control of the analyst choosing f^ (x)). Consider a further decomposition of
the second term.
Suppose that T is used to select a function (say gT ) from some linear sub-
2
space, say S = fgg, of the space functions h with Ex (h (x)) < 1, and that
ultimately one uses as a predictor
f^ (x) = gT (x)
Since linear subspaces are convex
g ET gT = ET f^ 2 S
Further, suppose that
2
g arg minEx (g (x) E [yjx])
g2S

is the projection of (the function of x) E[yjx] onto the space S. Then write
h (x) = E [yjx] g (x)
so that
E [yjx] = g (x) + h (x)
Then, it’s a consequence of the facts that Ex (h (x) g (x)) = 0 for all g 2 S
and therefore that Ex (h (x) g (x)) = 0 and Ex (h (x) g (x)) = 0, that
2
2
Ex ET f^ (x) E [yjx] = Ex (g (x) (g (x) + h (x)))
2 2
= Ex (g (x) g (x)) + Ex (h (x))
2Ex ((g (x) g (x)) h (x))
2
2
= Ex ET f^ (x) g (x) + Ex (E [yjx] g (x)) (13)

The …rst term on the right in the last line of display (13) is an average squared
…tting bias, measuring how well the average (over T ) predictor function approx-
imates the element of S that best approximates the conditional mean function.
This is a measure of how appropriately the training data are used to pick out
elements of S. The second term on the right is an average squared model bias,
measuring how well it is possible to approximate the conditional mean function
E[yjx] by an element of S. This is controlled by the size of S, or e¤ectively the
‡exibility allowed in the form of f^. Average squared prediction bias can thus
be large because the form …t is not ‡exible enough, or because a poor …tting
method is employed.
Then using expressions (12) and (13)
2
Err = Ex Var [yjx] + Ex (E [yjx] g (x))
2
+ Ex ET f^ (x) g (x) + Ex VarT f^ (x)

19
So this SEL decomposition of Err is related to the general one in display (8) in
that

minimum expected loss possible = expected (across x) response variance


= Ex Var [yjx] ;

modeling penalty = expected (across x) squared model bias


2
= Ex (E [yjx] g (x)) ;

and
expected (across x) expected (across x)
…tting penalty = +
squared …tting bias prediction variance
2
= Ex ET f^ (x) g (x) + Ex VarT f^ (x)

The facts that

1. what is under the control of a data analyst, namely the modeling and
…tting penalties, has elements of both bias and variance and

2. complex predictors tend to have low bias and high variance in comparison
to simple ones

leads to the necessity of balancing these elements in predictor development and


the so-called variance-bias trade-o¤. Once more, in qualitative terms, it is
the matching of predictor complexity to real information content of
a training set that is at issue here.

1.3.6 Approximating Err and Cross-Validation


In rough terms, standard methods of constructing predictors all have associated
"complexity parameters" (like k for k-nearest neighbor methods, numbers and
types of features/basis functions or "ridge parameters" used in regression meth-
ods, and penalty weights/band-widths/neighborhood sizes applied in smoothing
methods) that are at the choice of a user. Depending upon the choice of com-
plexity, one gets more or less ‡exibility in the form f^. If a choice of complexity
doesn’t allow enough ‡exibility in the form of a predictor, under…t occurs and
there is large bias in prediction. On the other hand, if the choice allows too much
‡exibility, bias may be reduced, but the price typically paid is large variance of
prediction and over…t. It is a theme that runs through this material that
complexity must be chosen in a way that balances variance and bias
for the particular combination of N and p and general circumstance
one faces. That choice of predictor complexity of course depends upon reliable
means of assessing (the unknown theoretical test error) Err in display (7).

20
The most obvious/elementary means of approximating Err is the so-called
"training error"
N
1 X
err = L f^ (xi ) ; yi (14)
N i=1

The problem is that err is no good estimator of Err (or any other sensible quan-
ti…cation of predictor performance). It typically decreases with increased com-
plexity (without an increase for large complexity), and fails to reliably indicate
performance outside the training sample. The situation is like that portrayed
in Figure 2.

Figure 2: Cartoon Portraying Err and err as Functions of Predictor Complexity

The fundamental point here is that one cannot both "…t" and "test" on the
same dataset and arrive at a reliable assessment of predictor e¢ cacy. Behaving
in such manner will almost always suggest use of a predictor that is too complex
and has a relatively large "test error" Err.
The existing practical options for evaluating likely performance of a predictor
(and guiding choice of appropriate complexity) then include the following.

1. One might employ some function of err that is a better indicator of likely
predictor performance, like Mallows’Cp , "AIC," and "BIC."
2. In genuinely large N contexts, one might hold back some random sample of
the training data to serve as a "test set," …t to produce f^ on the remainder,
and use X
1
L f^ (xi ) ; yi
size of the test set
i2 the
test set

to indicate likely predictor performance.

3. One might employ sample re-use methods to estimate Err and guide choice
of complexity. Cross-validation and bootstrap ideas are used here.

21
We’ll say more about these possibilities later, but here describe the most
important of them, so-called cross-validation. K-fold cross-validation consists
of

1. randomly breaking the training set into K disjoint roughly equal-sized


pieces ("folds"), say T 1 ; T 2 ; : : : ; T K ,
2. training on each of the reduced training sets T T k (that we will call
corresponding "remainders") to produce K predictors f^k ,
3. letting k (i) be the index of the fold T k containing training case i, and
computing the cross-validation error
N
1 X
CV f^ = L f^k(i) (xi ) ; yi (15)
N i=1

that one hopes approximates Err.

(This is roughly the same as …tting on each remainder T T k and correspond-


ingly testing on fold T k , and then averaging. When N is a multiple of K, these
are exactly the same.) Assuming that one has randomized the order of the
cases in a training set, Figure 3 portrays the K folds and how cross-validation
proceeds.

Figure 3: Schematic for K-fold cross-validation.

The choice K = N is called "leave one out (LOO) cross-validation" and in


this case there are sometimes slick computational ways of evaluating CV f^ .
This has been true for ordinary least squares SEL prediction for some time and
recent work of Zou and Wang has provided results for classi…cation problems as
well. As discussed more completely in Section 16.3, cross-validation actually
estimates Err for a training set of size approximately N 1 K 1 , so there is
potential bias7 that typically decreases with increasing K, making LOO cross-
validation attractive from this point of view.
7 Note that for the use of cross-validation error to identify appropriate complexity, bias is

a problem only if it is not constant across choices of predictors and their complexities.

22
Notice that unless K = N , even for …xed training set T , CV f^ is random,
owing to its dependence upon random assignment of training cases to folds. It
is thus highly attractive in cases where K < N is used, to replace CV f^ with

an average cross-validation error (say CV f^ ) derived from a large number of


repeated splittings of T into K folds. The caret package in R (and, presum-
ably, similar packages in other systems) facilitates this repeated cross-validation
for a variety of prediction methods and e¤ectively replaces CV f^ with its ex-
pected value across randomizations. The fact that this averaging (and related
computational burden) is not needed for LOO cross-validation is another reason
to …nd it attractive.
LOO cross-validation has been portrayed in the statistical folklore as suf-
fering from a large variance. The argument has been that because its f^k are
all built on nearly the same training sets and produce similar predictions, the
averaging done in computing CV f^ might be relatively ine¤ective in reduc-
ing variance. This logic has been thought to motivate bias-variance trade-o¤
considerations for representation of Err, making K = 5 and K = 10 popular
choices in practice. But the recent work of Zou and Wang has strongly called
into question the truth of the folklore and makes a convincing case for LOO
cross-validation when it is feasible.

1.3.7 Choosing a Predictor Based on Cross-Validation


One popular rule of thumb for choosing between predictors of di¤ering complex-
ities on the basis of a single K-fold cross-validation for each (with K < N ) has
been this. For the complexity producing the smallest realized cross-validation
error, one computes a "standard error" for the prediction error. That is, for each
fold T k , one computes a kth "test error" (call it CVk f^ ), for f^k (referred to in
step 2.) obtained by …tting on remainder T T k , evaluating on T k . Then for
SDK the sample standard deviation of CV1 f^ ; CV2 f^ ; : : : ; CVK f^ , the
p
"standard error" of interest is SDK = K. One then selects for use the least
complex predictor with its own corresponding cross-validation error no larger
than p
CV f^ + SDK = K

This is sometimes called the "one standard error rule of thumb" and is presum-
ably motivated by recognition of the uncertainty involved in cross-validation
(deriving from the randomness of 1) the selection of the training set and 2) the
partitioning of it into folds) and the desire to avoid over…tting. But (in light
of the dependence of the CVk f^ )) the validity of the supposed standard error
is at best quite approximate, and then the appropriateness of a "one standard
error rule" is not at all obvious.
The most obvious, aggressive, and logically defensible way of using CV f^

23
(or CV f^ ) to choose a predictor is to simply use the f^ minimizing the function
CV ( ) (or CV ( )). We will call this way of operating a "pick-the-(cross-
validation error)-winner rule."
It is an important and somewhat subtle point that if

f~ = arg minCV f^
f^

the minimum cross-validation error CV f~ (or CV f~ ) is not a valid cross-


validation error for a pick-the-winner rule!8 The issue is that while CV f^ (or

CV f^ ) can legitimately guide the choice of f^, its use is then actually part of a
larger program of "predictor development" than that represented by any single
argument of CV ( ) (or CV ( )). That being the case, in order to assess the
likely performance of f~, via cross-validation, inside each remainder T T k
one must
1. split into K folds,
2. …t on the K remainders,
3. predict on the folds and make a cross-validation error,
4. pick a winner for the function in 3., say f~k , and
5. then predict on T k using f~k .
It is the values f~k(i) (xi ) that are used in form (15) to predict the performance
of a predictor derived from optimizing a cross-validation error across a set of
predictors.
The basic principle at work here (and always) in making valid cross-validation
errors is that whatever one will ultimately do in the entire training set
to make a predictor must be redone (in its entirety!) in every re-
mainder and applied to the corresponding fold.

1.3.8 Penalized Training Error Fitting and Choosing Complexity


A way of creating (and ultimately using cross-validation to choose a good value
for) an explicit numerical complexity measure in supervised learning is through
the notion of penalization of training error. That is, suppose that in the frame-
work of Section 1.3.4 one can de…ne for every element of the class of functions
S = fgg a complexity penalty J [g] 0 and for every 0 de…nes a measure
of undesirability for g re‡ecting both …t to the training data and complexity by
N
1 X
err + J [g] = L (g (xi ) ; yi ) + J [g] (16)
N i=1
8 Intuition suggests that it will typically be optimistic as representing Err for the pick-the-

winner predictor.

24
Call the function optimizing this objective (over choices of g) for a given by
the name f^ . The smaller is , the more complex will be f^ .
As a simple example, consider p = 1 SEL prediction on < with standardized
input x. With S = 1 x + 2 x2 + 3 x3 j 1 ; 2 ; 3 are all real , using J [g] =
2 2
2 + 3 penalizes lack of linearity in a …tted cubic. Small produces essentially
least squares …tting of a cubic and large produces least squares …tting of a
line.
Applying this penalized …tting to each remainder T T k to produce K
predictors f^k , one can as in display (15) derive a cross-validation error corre-
sponding to as
N
1 X k(i)
CV ( ) = L f^ (xi ) ; yi
N i=1

To produce a pick-the-winner rule in this context, one minimizes this (or an


average cross-validation error CV ( ) if K < N is employed) by choice of ,
producing the optimizer, say, opt (a function of the training set), and ultimately
employs opt and the criterion (16) with the whole training set (T ) to produce
the pick-the-winner predictor f~ = f^ o p t for application.

1.4 Good Features and Prediction


There are sometimes more or less standard/obvious ways for taking a small
number (p) of "original" features and making (often many) additional ones.
Powers of original variables xj make sense where polynomial predictors for a
quantitative y are natural. Where a quantitative y can be expected to have
periodic character, sin and/or cos functions can be useful, etc.
Exactly how to think about such data preprocessing and feature-making
is not always completely obvious. It is the intention here to raise several
conceptual and practical issues that potentially arise in feature engineering in
a supervised learning problem.

1.4.1 Classi…cation Models and Optimal Features


Consider …rst a K-class classi…cation model, where y takes values in G =f0; 1;
: : : ; K 1g. P then has K conditional distributions for xjy, that we will assume
are speci…ed by densities

p (xj0) ; p (xj1) ; : : : ; p (xjK 1)

(There is no loss of generality here. These could be densities with respect to the
simple arithmetic average of the K class-conditional distributions.) There is
important statistical theory concerning minimal su¢ ciency that promises that
regardless of the original dimensionality of x (namely, p) there is a (K 1)-
dimensional feature that carries all available information about y encoded in
x.

25
For K = 2 the 1-dimensional likelihood ratio statistic
p (xj1)
L (x) = (17)
p (xj0)

is "minimal su¢ cient." If one knew the value of L (x) one would know all x
has to say about y. An optimal single feature is L (x) : In a practical problem,
the closer that one can come to engineering features "like" L (x), the more
e¢ ciently/parsimoniously one represents the input vector x. Of course, any
monotone transform of L (x) is equally as good as L (x).
For K > 2, roughly speaking the K 1 ratios p (xjk) =p (xj0) (taken together)
form a minimal su¢ cient statistic for the model. This potentially isn’t quite
true because of possible problems where p (xj0) = 0. But it is true that with
PK 1
s (x) = k=0 p (xjk) the vector

p (xj1) p (xj2) p (xjK 1)


; ;:::; (18)
s (x) s (x) s (x)

(and many variants of it) is (are) minimal su¢ cient. To the extent that one
can engineer features approximating these K 1 ratios9 , one can parsimoniously
represent the input vector.

1.4.2 Approximating "Partially Optimal" Numerical Features for


Discrete Parts of Input Vectors
When one or more coordinates xj of an input vector x are categorical, ordinal,
or numerical-but-discrete it can be useful to try to represent the information
they together provide about y in terms of a (low-dimensional) feature taking
values in <q for a relatively small q. Numerical features are simply more
directly handled by standard prediction methodologies than categorical, ordinal,
or even discrete numerical ones. Here we consider low-dimensional "partially
optimal" numerical features based on vectors of categorical, ordinal, and/or
discrete numerical inputs and empirical approximations to them.
Suppose that a sub-vector of x, say x = (xj1 ; xj2 ; : : : ; xjD ), has entries with
respectively only …nite numbers M1 ; M2 ; : : : ; MD of possible values, so that the
sub-vector has M = M1 M2 MD possible values. One standard way of
representing such an x is through the use of M 1 dummy (0-1) variables, one
for every possible value of x except an arbitrarily chosen "last" one. That deals
with the possibility that parts of x are ordinal or categorical with more than 2
possible values in terms of making arithmetic operations applied to their repre-
sentations sensible. But it also explodes the number of features representing x
from D to M , motivating contemplation of another approach.
Consider then the case of classi…cation models where y takes values in a …nite
set G = f0; 1; : : : ; K 1g. There are M K possible values of (x; y). Then
(based on all or a …xed subset of the full training set) with Nx;y the number of
9 These are the K 1 conditional probabilities P [y = 1jx] ; : : : ; P [y = K 1jx] for the case
where each P [y = k] = 1=K.

26
P
training cases with xi = x and yi = y P
(), let N:;y = x Nx;y be the number of
training cases with xi = x and Nx;: = y Nx;y be the number of training cases
withyi = y. The vector function of x
1
P^ (yjx) = (Nx;1 ; Nx;2 ; : : : ; Nx;K 1) (19)
Nx;:

serves as an approximation to the numerical (K 1)-dimensional feature with


KP1
entries P [y = kjx]. And with s^ (x) = (Nx;k =N:;k ), the vector function of
k=0
x
^ (x) = 1 Nx;1 Nx;2 Nx;K 1
L ; ;:::; (20)
s (x) N:;1 N:;2 N:;K 1

serves as an approximation to the numerical (K 1)-dimensional feature with


KP1
entries p (xjk) = p (xjk).
k=0
We noted in the previous section that the vector function with entries
KP1
p (xjk) = p (xjk) is minimal su¢ cient in the classi…cation model. And the
k=0
vector with entries P [y = kjx] is an optimal predictor under cross-entropy loss
and a function of it is an optimal 0-1 loss classi…er. The versions of these in dis-
plays (19) and (20) based on x (rather than the full input vector x) thus might
then be considered "partially optimal," representing the best one could do sup-
plied only with the discrete part, x, of x. And then P^ (yjx) and/or L ^ (x) are
are approximate partially optimal numerical features of low dimension. How
useful they will be in practice will depend in part upon how large are the values
Nx;y , which in turn depends upon how large the (whole or partial) training set
is in comparison to M . As values of D and M employed increase, one should
expect the e¤ectiveness of the "partially optimal" features to increase and the
…delity of the approximations to them to decrease. Some trade-o¤ between
these e¤ects will be necessary and a sensible way to try and employ this idea in
practice is to build sets of these features with a spectrum of values M and look
for one that is overall most e¤ective.
Now drop the assumption that y has only K values and consider what in this
direction can be done in SEL prediction problems. We have noted repeatedly
that here the theoretically optimal predictor of y is f (x) =E[yjx], in some sense
an "unrealizable optimal feature" for prediction. By the same token, if one had
access to only x, an optimal feature for prediction would be E[yjx]. That
suggests thinking of this conditional mean function as a "partially optimal"
1-dimensional feature for encoding the information in x in the full prediction
problem.
Simple approximation to the function E[yjx] based on (all or part of) a train-
ing set is straightforward and can be an e¤ective way to make a 1-dimensional
numerical feature to represent the D-dimensional x. With Nx the number of
training cases with xi = x (based on all or a subset of the full training set), the

27
corresponding empirical mean output
1 X
y (x) = yi
Nx
i with xi =x

across the M possible values of x de…nes an approximate partially optimal


feature for SEL prediction. How useful this will be in practice will depend
in part upon how large the values Nx are, which in turn depends upon how
large the (whole or partial) training set is in comparison to M . As values of
D and M employed increase, one should expect the e¤ectiveness of the feature
E[yjx] to increase and the …delity of y (x) as an approximation to it to decrease.
Again, some trade-o¤ between these e¤ects seems necessary, and building and
comparing the performance of sets of these features with a spectrum of values
M seems sensible.
We have here repeatedly used phrases like "all or part of a training set."
It is not clear when it will be best to use an entire training set to make these
empirical approximations P^ (yjx) ; L ^ (x) ; or y (x), and when it will be best to
reserve only a part of the training set to to make them and then use the balance
for predictor-building based on these approximately partially optimal features
(and other features as appropriate). ISU use of a variant of the approximations
^ (x) built on a part of a training set reserved exclusively for feature engineering
L
led to an international …rst place in the 2014 Prudsys AG Data mining Cup.
The team’s intuition was that both using a training case in making the function
^ (x) and then subsequently using the case for …tting classi…er was likely to
L
cause over…t and poor performance on test cases.
Ultimately, questions of what fraction (if any) of a training set to reserve for
feature-making, which discrete sub-vector or sub-vectors to use in the develop-
ment of approximate partially optimal features, and all questions of subsequent
predictor …tting should in practice be answered via cross-validation (that does
every detail of making predictions on K remainders and tests on the correspond-
ing K folds). Cross-validation of all –including potential data splitting, choice
of M , and feature making— is needed to empirically gauge likely performance
on new cases. See remarks at the end of Section 1.4.5 for a bit more on this
issue.

1.4.3 Abstract Feature Spaces (of Functions) and "Kernels"


There are surely situations where what P encodes about a relationship between
x and y is very complicated and "non-linear" (whatever that might mean in
this context). Standard (and really, almost all tractable) mathematics of pre-
diction often relies on "linear" operations: additions of vectors, multiplication
of vectors by scalars, inner products (and associated norms and distances), etc.
"Ordinary" creation of features can be thought of as a way to map a feature
space <p (non-linearly) to a higher-dimensional (Euclidean and therefore linear)
feature space <q . But sometimes that is ine¤ective because q large enough to
in theory allow for good prediction based on linear operations is so large as to
make an appropriate transform from <p to <q impossible to identify and/or use.

28
A very clever and practically powerful development in machine learning has
been the realization that for some purposes, it is not necessary to map from <p
to a Euclidean space, but that mapping to a linear space of functions may be
helpful. That is, creation of new numerical features based on input vector x
can be thought of as transformation

T : <p ! <q

where relationships in <q or predictors mapping from <q and producing a y^


are then thought of as de…ning ones for xs in <p by simply applying T to xs
of interest. This line of reasoning doesn’t depend at all upon T mapping to a
Euclidean space. If A is an abstract feature space of functions (that is an inner
product space10 ) one might think of mapping

T : <p ! A

and using linear operations and relationships in A to make relationships and


predictors based on as in A, and then de…ning corresponding ones for xs in <p
by simply applying T to xs of interest. After all, in some sense functions are
really just high-dimensional vectors, and if transforming <p ! <q with p < q is
often useful, so also might be transforming <p ! A.
This line of argument has especially been taken advantage of through the use
of so-called "kernel functions." (Be careful. There are many di¤erent usages of
the word "kernel" in the machine learning world.) Suppose that a symmetric
function K (x; z) with domain some part of <p <p is non-negative de…nite in
the sense that for any training set T the (symmetric) N N so-called "Gram
matrix"
K = (K (xi ; xj )) i=1;:::;N (21)
j=1;:::;N

is non-negative de…nite. Then the space of functions that are linear combina-
tions of "slices" of K (x; z), i.e. functions of x of the form
M
X
cj K (x; z j )
j=1

for M > 0 real numbers c1 ; c2 ; : : : ; cM , and elements z 1 ; z 2 ; : : : ; z M of <p form


a linear space (call it A). It is possible to coherently de…ne a very convenient
inner product on that space starting from the basic relationship

hK ( ; z 1 ) ; K ( ; z 2 )iA K (z 1 ; z 2 ) (22)

and using the bilinearity of any inner product to see that then of necessity
*M M
+ M1 X
M2
X X X
c1j K ( ; z j ) ; c2j K ( ; z j ) = c1j c2j 0 K (z j ; z j 0 ) = c01 Kc2
j=1 j=1 A j=1 j 0 =1

1 0 See Section 2.1 for more concerning the meaning of this language.

29
for c01 = (c11 ; : : : ; c1M ), c02 = (c21 ; : : : ; c2M ), and M M matrix K with entries
K (z i ; z j ). This has the important special case that for c = c1 = c2

M
2 *M M
+
X X X
cj K ( ; z j ) = cj K ( ; z j ) ; cj K ( ; z j ) = c0 Kc
j=1 j=1 j=1 A
A

Of course,
Psince K de…nes the P inner product in A it also de…nes the distance
M M
between j=1 c1j K ( ; z j ) and j=1 c2j K ( ; z j )
0 1
M
X M
X q
0
dA @ c1j K ( ; z j ) ; c2j K ( ; z j )A = (c1 c2 ) K (c1 c2 )
j=1 j=1

(with c1 ; c2 ; and K as before).


Relationship (22) is the origin of the language that K serves as a repro-
ducing kernel. It both de…nes the linear space of functions of interest and
provides the inner product for the space. Under some conditions, the space
A (whose elements are functions <p ! <) can be extended to include limits
of …nite linear combinations of slices of the kernel function K ( ; ) and the re-
sulting construct is termed a Reproducing Kernel (Hilbert) Space (RKHS) of
functions.
In any event, having identi…ed an inner product space associated with a
kernel, the abstract transform T : <p ! A is de…ned by
T (x) ( ) = K (x; )
(remember here that T (x) ( ) is a function of " "). The inner product in A of
two images of elements of <p is
hT (x) ; T (z)iA = K (x; z)
and for a training set with inputs x1 ; x2 ; : : : ; xN the span of fT (xi )gi=1;:::;N is
a linear subspace of A.
Probably the most used kernel function in machine learning is the "Gaussian
kernel"
2
K (x; z) = exp kx zk
that produces abstract features
2
T (x) ( ) = exp kx k

that are radially symmetric p-variate Normal density functions located at x.


The function space consists of linear combinations of such functions (and limits
2
of them) and the abstract inner product of T (x) and T (z) is exp kx zk .
One can even give up requiring that the domain of a kernel function K (x; z)
is a subset of <p <p , replacing it with arbitrary X X and requiring only that
n
the Gram matrix be non-negative de…nite for any set of fxi gi=1 , xi 2 X . It is
in this context that the "string kernels" of "text processing" brie‡y discussed
in Section 1.4.4 can be called "kernels."

30
Kernel Mechanics A direct way of producing a kernel function is through a
Euclidean inner product of vectors of "features." That is, if : X ! <m (so
that component j of , j , creates the univariate real feature j (x)) then for
h ; i the usual Euclidean inner product (dot product),

K (x; z) = h (x) ; (z)i (23)

is a kernel function. (This basic idea will be used in Section 2.4.2.)


Section 6.2 of the book Pattern Recognition and Machine Learning by Bishop
notes that it is very easy to make new kernel functions from known ones. In
particular, for c > 0; K1 ( ; ) and K2 ( ; ) kernel functions on X X , h ( ) : X !
< arbitrary, q ( ) a polynomial with non-negative coe¢ cients, : X ! <m ,
m m
K3 ( ; ) a kernel on < < , and M a non-negative de…nite matrix, all of the
following are kernel functions:

1. K (x; z) = cK1 (x; z) on X X,


2. K (x; z) = h (x) K1 (x; z) h (z) on X X,
3. K (x; z) = q (K1 (x; z)) on X X,
4. K (x; z) = exp (K1 (x; z)) on X X,
5. K (x; z) = K1 (x; z) + K2 (x; z) on X X,
6. K (x; z) = K1 (x; z) K2 (x; z) on X X,
7. K (x; z) = K3 ( (x) ; (z)) on X X , and
8. K (x; z) = x0 M z on <m <m .

(Fact 7 generalizes the basic insight of display (23).) Further, if X X A XB


and KA ( ; ) is a kernel on XA XA and KB ( ; ) is a kernel on XB XB , then
the following are both kernel functions:

9. K ((xA ; xB ) ; (z A ; z B )) = KA (xA ; z A ) + KB (xB ; z B ) on X X , and


10. K ((xA ; xB ) ; (z A ; z B )) = KA (xA ; z A ) KB (xB ; z B ) on X X.

An example of a kernel on a somewhat abstract (but …nite) space is this.


For a …nite set B consider X = 2B , the set of all subsets of B. A kernel on
X X can then be de…ned by

K (B1 ; B2 ) = 2jB1 \B2 j for B1 B and B2 B

There are several probabilistic and statistical arguments that can lead to
forms for kernel functions. For example, a useful fact from probability theory
(Bochner’s Theorem) says that characteristic functions for p-dimensional dis-
tributions are non-negative de…nite complex-valued functions of s 2 <p . So if
(s) is a real-valued characteristic function, then

K (x; z) = (x z)

31
is a kernel function on <p <p . Related to this line of thinking are lists
of standard characteristic functions (that in turn produce kernel functions) and
theorems about conditions su¢ cient to guarantee that a real-valued function is a
characteristic function. For example, each of the following is a real characteristic
function for a univariate random variable (that can lead to a kernel on <1 <1 ):

1. (t) = cos at for some a > 0,


sin at
2. (t) = for some a > 0,
at
3. (t) = exp at2 for some a > 0, and
4. (t) = exp ( a jtj) for some a > 0:

And one theorem about su¢ cient conditions for a real-valued function on <1
to be a characteristic function says that if is symmetric ( ( t) = (t)),
(0) = 1, and is decreasing and convex on [0; 1), then is the characteristic
function of some distribution on <1 . (See Chung’s A Course in Probability
Theory, page 191.)
Bishop points out two constructions motivated by statistical modeling that
yield kernels that have been used in the machine learning literature. One is
this. For a parametric model on (a potentially completely abstract) X , consider
densities p (xj ) that when treated as functions of are likelihood functions (for
various possible observed x). Then for a distribution G for 2 ,
Z
K (x; z) = p (xj ) p (zj ) dG ( )

is a kernel. This is the inner product in the space of square integrable functions
on the probability space with measure G of the two likelihood functions. In
this space, the distance between the functions (of ) p (xj ) and p (zj ) is
sZ
2
(p (xj ) p (zj )) dG ( )

and what is going on here is the implicit use of (in…nite-dimensional) features


that are likelihood functions for the "observations" x. Once one starts down
this path, other possibilities come to mind. One is to replace likelihoods with
loglikelihoods and consider the issue of "centering" and even "standardization."
That is, one might de…ne a feature (a function of ) corresponding to x as
Z
0
x ( ) = ln p (xj ) or x ( ) = ln p (xj ) ln p (xj ) dG ( )
R
ln p (xj ) ln p (xj ) dG ( )
or even 00x ( ) = qR R 2
ln p (xj ) ln p (xj ) dG ( ) dG ( )

32
Then obviously, the corresponding kernel function is
Z Z
0 0 0
K (x; z) = x ( ) z ( ) dG ( ) or K (x; z) = x ( ) z ( ) dG ( )
Z
or K00 (x; z) = 00 00
x ( ) z ( ) dG ( )

(Of these three possibilities, centering alone is probably the most natural from
a statistical point of view. It is the "shape" of a loglikelihood that is important
in statistical context, not its absolute level. Two loglikelihoods that di¤er by a
constant are equivalent for most statistical purposes. Centering perfectly lines
up two loglikelihoods that di¤er by a constant.)
In a regular statistical model for x taking values in X with Euclidean para-
meter vector = ( 1 ; 2 ; : : : ; k ), the k k Fisher information matrix, say I ( ),
is non-negative de…nite. Then with score function
0 1
@
B @ 1 ln p (xj ) C
B @ C
B C
B ln p (xj ) C
B
r ln p (xj ) = B @ 2 C
.. C
B . C
B C
@ @ A
ln p (xj )
@ k
(for any …xed ) the function
0 1
K (x; z) = r ln p (xj ) (I ( )) r ln p (zj )

has been called the "Fisher kernel" in the machine learning literature. (It
follows from Bishops’s 7. and 8. that this is indeed a kernel function.) Note that
K (x; x) is essentially the score test statistic for a point null hypothesis about .
The implicit feature vector here is the k-dimensional score function (evaluated
at some …xed , a q basis for testing about ), and rather than Euclidean norm,
1
the norm kuk u0 (I ( )) u is implicitly in force for judging the size of
di¤erences in feature vectors.

1.4.4 Document Features and String Kernels for Text Processing


An important application of various kinds of both supervised and unsupervised
learning methods is that of text processing. The object is to quantify structure
and commonalities in text documents. Patterns in characters and character
strings and words are used to characterize documents, group them into clusters,
and classify them into types. We here say a bit about some simple methods
that have been applied.
Suppose that N documents in a collection (or corpus) are under study. One
needs to de…ne "features" for these, or at least some kind of "kernel" functions
for computing the inner products required for producing principal components

33
in an implicit feature space (and subsequently clustering or deriving classi…ers,
and so on).
If one treats documents as simply sets of words (ignoring spaces and punctu-
ation and any kind of order of words) one simple set of features for documents
d1 ; d2 ; : : : ; dN is a set of counts of word frequencies. That is, for a set of p
words appearing in at least one document, one might take

xij = the number of occurrences of word j in document i

and operate on a representation of the documents in terms of an N p data


matrix X. These raw counts xij are often transformed before processing.
One popular idea is the use of a "tf-idf" (term frequency-inverse document
frequency) weighting of elements of X. This replaces xij with
N
tij = xij ln PN
i=1 I [xij > 0]

or variants thereof. (This up-weights non-zero counts of words that occur


in few documents. The logarithm is there to prevent this up-weighting from
overwhelming all other aspects of the counts.) One might also decide that
document length is a feature that is not really of primary interest and determine
to normalize vectors xi (or ti ) in one way or another. That is, one might begin
with values
x xij
Pp ij or
j=1 xij kxi k
rather than values xij . This latter consideration is, of course, not relevant if
the documents in the corpus all have roughly the same length.
Processing methods that are based only on variants of the word counts xij
are usually said to be based on the "Bag-of-Words." They obviously ignore
potentially important word order. (The instructions "turn right then left" and
"turn left then right" are obviously quite di¤erent instructions.) One could
then consider ordered pairs or n-tuples of words.
So, with some "alphabet" A (that might consist of English words, Roman
letters, amino acids in protein sequencing, base pairs in DNA sequencing, etc.)
consider strings of elements of elements of the alphabet, say

s = b1 b2 bjsj where each bi 2 A

A document (... or protein sequence ... or DNA sequence) might be idealized


as such a string of elements of A. An n-gram in this context is simply a string
of n elements of A, say

u = b1 b2 bn where each bi 2 A

Frequencies of unigrams (1-grams) in documents are (depending upon the al-


phabet) "bag-of-words" statistics for words or letters or amino acids, etc. Use of
features of documents that are counts of occurrences of all possible n-grams ap-
n
pears to often be problematic, because unless jAj and n are fairly small, p = jAj

34
will be huge and then X huge and sparse (for ordinary N and jsj). And in
many contexts, sequence/order structure is not so "local" as to be e¤ectively
expressed by only frequencies of n-grams for small n.
One idea that seems to be currently popular is to de…ne a set of interesting
p
strings, say U = fui gi=1 and look for their occurrence anywhere in a document,
with the understanding that they may be realized as substrings of longer strings.
That is, when looking for string u (of length n) in a document s, we count every
di¤erent substring of s (say s0 = si1 si2 sin ) for which

s0 = u

But we discount those substrings of s matching u according to length as fol-


lows. For some > 0 (the choice = :5 seems pretty common) give matching
substring s0 = si1 si2 sin weight
0
in i1 +1
= js j

so that document i (represented by string si ) gets value of feature j


X
ln l1 +1
xij = (24)
sil1 sil2 siln =uj

It further seems that it’s common to normalize the rows of X by the usual
Euclidean norm, producing in place of xij the value
xij
(25)
kxi k
This notion of using features (24) or normalized features (25) looks attrac-
tive, but potentially computationally prohibitive, particularly since the "inter-
esting set" of strings U is often taken to be An . One doesn’t want to have to
compute all features (24) directly and then operate with the very large matrix
X. But just as we were reminded in Section 2.4.2, it is only XX 0 that is re-
quired to …nd principal components of the features (or to de…ne SVM classi…ers
or any other classi…ers or clustering algorithms based on principal components).
So if there is a way to e¢ ciently compute or approximate inner products for
rows of X de…ned by form (24), namely
0 10 1
X X X
hxi ; xi0 i = @ ln l1 +1 A @ mn m1 +1 A

u2An sil1 sil2 siln =u si0 m1 si0 m2 si0 mn =u


X X X
ln l1 +mn m1 +2
=
u2An sil1 sil2 siln =u si0 m1 si0 m2 si0 mn =u

it might be possible to employ this idea. And if the inner products hxi ; xi0 i
can be computed e¢ ciently, then so can the inner products
1 1 hxi ; xi0 i
xi ; xi0 =p
kxi k kxi0 k hxi ; xi i hxi0 ; xi0 i

35
needed to employ XX 0 for the normalized features (25). For what it is worth,
it is in vogue to call the function of documents s and t de…ned by
X X X
ln l1 +mn m1 +2
K (s; t) =
u2An sl1 sl2 sln =u tm1 tm2 tmn =u

the String Subsequence Kernel and then call the matrix XX 0 = (hxi ; xj i) =
¯ ¯ ¯
(K (si ; sj )) the Gram matrix for that "kernel." The good news is that there are
fairly simple recursive methods for computing K (s; t) exactly in O (n jsj jtj) time
and that there are approximations that are even faster (see the 2002 Journal
of Machine Learning Research paper of Lodhi et al.). That makes the implicit
use of features (24) or normalized features (25) possible in many text processing
problems.

1.4.5 "Feature Engineering" and Data "Pre-processing": More Per-


spective and Prediction of Predictor E¢ cacy
Feature engineering amounts to replacing every observation vector x with a …xed
function/transform thereof, T (x). It should then be completely obvious that
feature engineering cannot produce training data that are intrinsically "more
informative" than the original ones. In fact, if the function T ( ) is not one-to-
one, a transformed dataset is potentially less informative than the original in the
absolute sense of its potential usefulness.11 What then is the point of feature
engineering? It is to put data into a form compatible with simple existing
methods of processing inputs into outputs or to provide additional predictors
beyond what standard methods produce when applied directly. It is a common
form of sloppy thought or expression to say that feature engineering makes data
more informative. Rather, it can make then more compatible with standard
prediction methodologies than the original training set12 or extend the ‡exibility
of those standard prediction methodologies.
As a toy example, consider a 2-class classi…cation problem with x 2 <2
0
where every training case with xi (2; 2) < 1 has yi = 0 and every training
0
case with xi (2; 2) 1 has yi = 1. Then a classi…er with with (0-1 loss)
err = 0 is h i
2 2
f^ (x) = I (xi1 2) + (xi2 2) 1

which is, for example, not expressible in terms of two regions in <2 with linear
boundaries. However, if one de…nes the nonlinear transform T : <2 ! <5 by
0
T (x) = x1 ; x2 ; x21 ; x22 ; x1 x2
1 1 The theory of statistical su¢ ciency is concerned with what non-one-to-one transforms do

not cause loss of information.


1 2 For example, reducing a signal to a set of Fourier coe¢ cients does not increase information

about the signal. But it does replace the signal with a set of variables that are potentially more
convenient than the original signal itself in terms of exisiting signal processing methodology.

36
then a very small amount of algebra shows that the classi…er can be written in
terms of a linear combination of coordinates of T (x) as

( 4; 4; 1; 1; 0) T (x) 7

That is, thought of as de…ned in terms of T (x) 2 <5 (in terms of the input
transformed to the higher-dimension space <5 ) the classi…er is de…ned by a very
simple linear (inner product) operation.
The toy example is instructive because it has characteristics of a strategy
that is commonly e¤ective in practice. That is one where a nonlinear transform
is employed to map training cases into a linear space in which simple operations
are used to de…ne a predictor. (It should be noted that in the event that x takes
values in a linear space like <2 , a linear transform has no potential to provide
the kind of advantage seen in the hypothetical example. That is because a linear
transform can only map the training set to a linear subspace of dimension no
more than that spanned by the original training set.)
Notice that the thinking here substantially blurs any perceived line between
"feature engineering" and "predictor …tting." They are both really parts of a
single process and one cannot be treated as inconsequential to the production of
the test error, Err (nor ignored it attempts to represent it empirically through
cross-validation).
It is also important to think clearly about what goes into the making of
transformed feature T (x). The intent of the notation T (x) is that the form
of the function T ( ) does not depend upon the training set. But sometimes
data "pre-processing" e¤ectively violates this understanding, making the form
of the function training-set-dependent. One might use notation like T (T ; ) to
represent this and this issue must be carefully handled in cross-validation.
That is, if one is contemplating use of a predictor built upon a training
set (T (T ; x1 ) ; y1 ) ; (T (T ; x2 ) ; y2 ) ; : : : ; (T (T ; xN ) ; yN ) and hopes to use K-fold
cross-validation to reliably predict predictor performance, …tting on remainder
k must be done using not values (T (T ; xi ) ; yi ) for cases in remainder k, but
rather values (T (T T k ; xi ) ; yi ). For example, as mentioned in Section 1.3.6,
when building predictors based on standardized inputs, standardization must be
done afresh for each new remainder! If the training set will be used to choose a
parameter of a kernel for use in de…ning abstract features associated with input
vectors, the same kind of choice must be made one remainder at a time, etc.
Failure to do so breaks the cross-validation paradigm and the basic maxim that
whatever is ultimately going to be done to make predictions must
be done in each individual remainder, i.e. must be done K times.
Typically, failure to follow this maxim will produce unduly optimistic (and
substantially wrong) supposed "cross-validation errors."
This matter seems particularly important to recognize in cases (like those
where a training set will be used to make approximate likelihood ratios per
Section 1.4.2) where the responses in the training set or remainder (not the
inputs only) are involved in the making of new features. The issue also raises
the question of exactly how best to use a training set (or remainder) to both 1)

37
choose T as a function of the training set (or remainder) and then 2) build a
predictor. Two possibilities are to 1) use the entire training set (or remainder) in
both steps, or to 2) randomly split the training set (or remainder) into two parts,
the …rst for use in choosing the form of T and the other for use in subsequently
building the prediction algorithm. Which of these (or some other version of
them) is likely to be most e¤ective is not clear. What is clear is that care
must be taken to "separately do in each remainder in a cross-validation all that
will be ultimately done with the full training set" if one is to produce reliable
cross-validation errors.

1.5 Some More Generalities for 2-Class Classi…cation


In Section 1.3.2 we identi…ed a theoretically optimal (0-1 loss) K-class classi…er
as
f (x) = arg maxP [y = kjx]
k

By far, the most important version of this is the K = 2 case. And for this case,
there are some very important additional general insights that we proceed to
discuss.

1.5.1 More on the Form of an Optimal 0-1 Loss Classi…er for K = 2


For K = 2, for various purposes di¤erent ones of the (arbitrary and completely
equivalent) codings for the possible values of y

f0; 1g ; f1; 2g ; and f 1; 1g

prove useful. For the time being, employ the …rst and abbreviate P [y = 1]
as (so that P [y = 0] = 1 ), and write p (xj1) and p (xj0) for the two
class-conditional densities for x. Then
p (xj1)
P [y = 1jx] = and (26)
p (xj1) + (1 ) p (xj0)
(1 ) p (xj0)
P [y = 0jx] =
p (xj1) + (1 ) p (xj0)
An optimal classi…er is then

f (x) = I [P [y = 1jx] > :5] (27)


= I [P [y = 1jx] > P [y = 0jx]]
p (xj1) (1 )
=I >
p (xj0)
(1 )
= I L (x) > (28)

and one decides in favor of y = 1 when P [y = 1jx] is large, or equivalently the


likelihood ratio L (x) de…ned in (17) is large. Notice that this latter insight

38
makes connection to classical statistical theory and identi…es the optimal clas-
si…er as a Neyman-Pearson test of the simple hypotheses H0 : y = 0 versus
Ha : y = 1 with "cut-point" the ratio (1 )= .
As a slight generalization of this development, note that for l0 0 and l1 0
and an asymmetric loss
L (by ; y) = ly I [b
y 6= y]
an optimal classi…er is
(1 ) l0
f (x) = I L (x) >
l1
In fact, for a completely general choice of four losses L (b
y ; y) in a 2-class clas-
si…cation model, it is easy enough to argue that for L (1; 0) L (0; 0)
L (1; 1) + L (0; 1), L (1; 0) L (0; 0), and R = j j an optimal classi…er
is
f (x) = I [P [y = 1jx] > R]
which for R 2 (0; 1) is
(1 )R
f (x) = I L (x) >
(1 R)

Shifting P [y = 1]: E¤ects on P [y = 1jx] and the Form of an Optimal


0-1 Loss Classi…er An important issue in classi…cation models is the e¤ect
of changes in on both P [y = 1jx] and (optimal classi…er) f (x). There are
situations, for example, in which is very extreme (one class is rare)13 and it
is then common practice to build a predictor using a training set made with
relative frequency of y = 1 that is , a value that is much more moderate
(nearer to :5) than . The obvious question is how to translate results for the
synthetic value to results for the real value .
Relationship (26) implies that
L (x)
P [y = 1jx] =
(1 )
L (x) +

and that
(1 )
P [y = 1jx]
L (x) =
1 P [y = 1jx]
So, for the time being subscripting P with either or depending upon which
marginal probability of y = 1 is operating (in models with the same class-
conditional densities p (xj1) and p (xj0)),
(1 ) P [y = 1jx]
1 P [y = 1jx]
P [y = 1jx] = (29)
(1 ) P [y = 1jx] (1 )
+
1 P [y = 1jx]
1 3 The terminology of "extreme class imbalance" is commonly used.

39
from which it is obvious how to translate an estimate of P [y = 1jx] made from
a synthetically balanced training set to one for the real situation described by
. Further, an optimal classi…er (27) or (28) is

P [y = 1jx] (1 )
I >
1 P [y = 1jx] (1 )

and it is obvious how to translate an estimate of P [y = 1jx] made from a


synthetically balanced training set to an approximately optimal classi…cation
for the real situation described by .
For example, considering the k-nearest neighbor set-up of Section 1.3.3 using
a training set made with relative frequency of y = 1 that is when the real
probability that y = 1 is , the right use of a neighborhood of x containing
n1 (x) cases xi with y = 1 and n0 (x) = k n1 (x) with y = 0, is to classify
according to
I [n1 (x) (1 ) > n0 (x) (1 )]
which is the appropriate modi…cation of the simple k-nearest neighbor rule made
to account for the di¤erence between and .

1.5.2 Other Prediction Problems in 2-Class Classi…cation Models


There are other (besides 0-1 loss) standard prediction problems sometimes con-
sidered in the 2-class classi…cation model. (These often show up as alternatives
to use of classi…cation error rate as test criteria in prediction contests.)
One such problem concerns class probability prediction. In a context where
y is in f0; 1g but y^ is allowed to be any real number in [0; 1], the so-called "log
loss" (that is the 2-class version of the cross-entropy loss of Section 1.3.2 as well
as the negative Bernoulli log-likelihood)

L (^
y ; y) = y ln y^ (1 y) ln (1 y^)

is sometimes employed. For this loss, a theoretically optimal predictor is

f (x) = E [yjx] = P [y = 1jx]

For reasons that will shortly become clear (in Section 1.5.3), it is sometimes
convenient to use not 0-1 coding but rather 1-1 coding in 2-class classi…cation
models, so that y is in f 1; 1g. Suppose that y^ is allowed to be any real number,
then three other (initially odd-looking) losses are sometimes considered, namely

L1 (^
y ; y) = ln (1 + exp ( y y^)) = ln (2) ,
L2 (^
y ; y) = exp ( y y^) , and
L3 (^
y ; y) = (1 y y^)+

40
For these losses, theoretically optimal predictors are respectively

P [y = 1jx]
f1 (x) = ln = ln L (x) ,
P [y = 1jx]
1 P [y = 1jx] 1
f2 (x) = ln = ln L (x) , and
2 P [y = 1jx] 2
f3 (x) = sign (P [y = 1jx] P [y = 1jx])

The "AUC" Criterion Another problem related to 2-class classi…cation uses


(1 minus) an "Area Under the Curve" (AUC) as a loss. One chooses a function
O (x) taking values in [ 1; 1] to order values of x (large O (x) indicating large
likelihood that y = 1). For independent x with the (P ) distribution of xjy = 0
and x with the (P ) distribution of xjy = 1 the theoretical "AUC" for O (to
be maximized) is
P [O (x) < O (x )] (30)
Arguments below (based on "receiver operating characteristic curves" and Neyman-
Pearson theory) establish that an optimal O (x) is the likelihood ratio L (x) de-
…ned in (17) or any monotone increasing transform, of it including P [y = 1jx]).

AUC Technical Details Conceptually, an empirical "ROC" curve for a


test set is this. For M test cases with M0 actual yi = 0 cases and M1 = M M0
actual yi = 1 cases, one plots M0 points

j
; p^1j for j = 1; 2; : : : ; M0 1; M0 (31)
M0

where if the test cases are arranged left to right as judged least-to-most likely
to have yi = 1;

p^1j = the fraction of yi = 1 cases to the right of the jth left-most yi = 0 case

If one then makes a step function from the plotted points (constant at the
vertical of a plotted point over the interval of length 1=M0 to its left) and then
computes the area under that "curve" one obtains an "AUC" (a …gure of merit
often used in predictive analytics contests). If the ordering of cases comes from
O, this area is
0 1
M0
1 X 1 X
@ 1 X
AU C = p^1j = I [O (xi ) < O (xj )]A (32)
M0 j=1 M0 i s.t. y =0 M1 j s.t. y =1
i j

Let G0 and G1 be respectively the y = 0 and y = 1 class conditional cdfs of


O (x). Then corresponding to the empirical AUC is the theoretical ("integrated
power") value Z
IP = (1 G1 (t))dG0 (t) (33)

41
IP is exactly the criterion (30) and in the event that G0 (t) is continuous and
increasing (and thus has an inverse) this is
Z 1
IP = 1 G1 G0 1 (u) du
0

Notice too that if for each t one builds from O a classi…er of the form

at (x) = I [O (x) > t]

the integrand in display (33) is the power of the test/classi…er as a function of


t and IP is an average (according to the y = 0 class conditional distribution of
O (x)), an "integrated power."
Let (t) be the Type I error rate of the test at (x), i.e.

(t) = E0 at (x) = P [O (x) > tjy = 0]

and (t) be the Type II error rate of at (x),

(t) = 1 E1 at (x) = P [O (x) 6 tjy = 1]

Another representation of IP is then this. As t runs from 1 to 1 the points

( (t) ; 1 (t)) (34)


2
trace out a (theoretical Receiver Operating Characteristic) curve in [0; 1] (the
theoretical version of the step function de…ned by points in display (31) made
from ordered test cases in order to compute the empirical AUC). The ordinary
integral over [0; 1] of the function de…ned by that parametric curve is IP , and
therefore the "higher" that parametric curve, the larger is the (theoretical) IP .
2
But consider the convex body in [0; 1] de…ned by all pairs ( ; 1 ) cor-
responding to possible classi…ers/tests (we may need to allow randomization
here). (This is a re‡ection of the set of all points ( ; ) comprising the 0-1
loss risk set of all possible classi…ers/tests.) The upper boundary of that con-
vex body (that corresponds to the lower boundary of the risk set) comes from
Bayes classi…ers/tests. It is guaranteed to lie "above" (at least as high) as the
parametric curve de…ned in display (34). But the form of optimal (Bayes and
Neyman-Pearson) tests/classi…ers is well-known. We have already said that an
optimal classi…er in the present context is as in display (28) and this brings us
to the conclusion: any O that is a monotone increasing transformation of the
likelihood ratio L (x) (is equivalent to the likelihood ratio) will optimize IP .

1.5.3 "Voting Functions," Losses for Them, and Expected 0-1 Loss
The fact that empirical search for a good 2-class classi…er is essentially search
for a good approximation to the likelihood ratio function L (x) raises another
kind of consideration for 2-class problems. That is the possibility of focusing
on the building of a good "voting function" g (x) to underlie a classi…er.

42
For the time being, it’s now convenient to employ the 1-1 coding of class la-
bels (use G = f 1; 1g) and to without much loss of generality consider classi…ers
de…ned for an arbitrary voting function g (x) by

f (x) = sign (g (x))

(except for the possibility that g (x) = 0, that typically has 0 probability for
both classes). Then an optimal voting function for 0-1 loss is

p (xj1) P [y = 1]
g opt (x) = (35)
p (xj 1) P [y = 1]

With this notation, a classi…er f (x) = sign(g (x)) produces 0-1 loss neatly
written as
L (^
y ; y) = I [yg (x) < 0]
(a loss of 1 is incurred when y and g (x) have opposite signs). So the the 0-1
loss expected loss/error rate has the useful representation

EI [yg (x) < 0] (36)

We have seen that a function g optimizing the average value (36) is g opt (x)
de…ned in (35). But the indicator function I [u < 0] involved in (36) is discon-
tinuous (and thus non-di¤erentiable), and for some purposes it would be more
convenient to work with a continuous (even di¤erentiable) one in making an
empirical choice of voting function.
If I [u < 0] h (u), it is obvious that

EI [yg (x) < 0] Eh (yg (x)) (37)

So the right hand side of display (37) functions as an upper bound for the 0-1
loss error rate and an approximate (data-based) minimizer of that right hand
side used as a voting function can be expected to control 0-1 loss error rate.
Several di¤erent continuous choices of "loss" h (u) can be viewed as motivating
popular methods of (voting function and) classi…er development. These include:

1. h1 (u) = ln (1 + exp ( u)) = ln (2) a function related to a Bernoulli negative


log-likelihood term when yg (x) is substituted for u,
2. h2 (u) = exp ( u) (the "exponential loss") associated with the AdaBoost.M1
algorithm, and
3. h3 (u) = (1 u)+ (the "hinge loss") associated with "support vector ma-
chines."

For reference, the indicator function I [u < 0] and the functions h1 (u) ; h2 (u) ;
and h3 (u) are plotted together in Figure 4.
One reason why this line of argument proves e¤ective is that not only does
bound (37) hold, but minimizers of a Eh (yg (x)) over choice of function g for

43
Figure 4: "Losses" I [u < 0] in black, h1 (u) in red, h2 (u) in blue, and h3 (u) in
green.

standard choices of h with h (u) I [u < 0] are directly related to the likelihood
ratio. This can be seen using the results concerning optimal predictors in 2-class
classi…cation models from Section 1.5.2. That is,

P [y = 1jx]
Eh1 (yg (x)) = EL1 (g (x) ; y) has optimizer g1opt (x) = ln
P [y = 1jx]
1 P [y = 1jx]
Eh2 (yg (x)) = EL2 (g (x) ; y) has optimizer g2opt (x) = ln and
2 P [y = 1jx]
P [y = 1jx]
Eh3 (yg (x)) = EL3 (g (x) ; y) has optimizer g3opt (x) = sign 1
P [y = 1jx]

The …rst two functions are are monotone transformations of the likelihood ratio
and when used as a voting function produce a (0-1 loss) optimal classi…er. The
third is the optimal classi…er itself.
So empirical search for optimizers of (an empirical version of) the risk
Eh (yg (x)) can produce good classi…ers. This has the fascinating e¤ect of
making SEL prediction and classi…cation look very much alike. Ultimately,
in development of a predictor, one is searching among some class of functions,
S, for a real-valued g making an appropriate empirical approximation of a risk
measure small.

1.6 Density Estimation and Approximately Optimal and


Naive Bayes Classi…cation
As another preliminary, we make a few comments on the problem of density
estimation. One might phrase the problem of describing structure for x in
terms of estimating a pdf for the variable. And a naive way of approximating
a theoretically optimal classi…er might be to directly estimate both class proba-
bilities and class conditional densities and use them in place of their estimands

44
in the optimal form (2) to produce

f^ (x) = arg maxP \


[y = k] p\
(xjk) (38)
k

So we consider the problem:

given x1 ; x2 ; : : : ; xN iid with (unknown) pdf q (x) , how to estimate q?

Initially suppose that p = 1. For g ( ) some …xed pdf (like, for example,
the standard normal pdf), invent a location-scale family of densities on < by
de…ning (for "bandwidth" > 0)

1
h( j ; ) = g

One may think of a corresponding "kernel" (this is a potentially di¤erent usage


of the word "kernel" than that in Section 1.4.3 and no non-negative de…niteness
of the function is needed or assumed)

K (; ) g

The Parzen estimate of q (x0 ) is then


N
1 X
q^ (x0 ) = h (x0 jxi ; )
N i=1
N
1 X
= K (x0 ; xi )
N i=1

an average of kernel values.


A standard choice of univariate density g ( ) is ( ), the standard normal
pdf. A way to think about the density estimate that results from using a
normal kernel is as representing the distribution of "a random choice from the
training set perturbed by a mean 0 normal error with standard deviation equal
to the bandwidth." If the bandwidth is extremely small, the density estimate
will essentially consist of "spikes" at the xi in the training set. If it is extremely
large, the density estimate will essentially consist of a normal density centered
around the mean of the xi . Useful bandwidths will be neither extremely small
nor extremely large.
Figure 5 provides a p = 1 pdf q (x) (in black), a sample of size N = 100
from the distribution and three Parzen estimates of q made with g ( ) = ( )
and bandwidths = :2 (red); :4 (blue), and :5 (green).
The natural generalization of this to p dimensions is to use a MVNp density
as a kernel (scaled according to ). One should expect that unless N is huge, this
methodology will be reliable only for fairly small p (say 3 at most) as a means of
estimating a general p-dimensional pdf. Figure 6 provides two representations

45
Figure 5: A p = 1 density, a corresponding sample of N = 100 values x, and
three density estimates based on di¤erent bandwidths.

Figure 6: Two representations of a particular 2-d pdf (a mixture of two bivariate


normal desnities).

of a bivariate density. Figure 7 then shows several samples of size N = 100


from the density and corresponding bivariate kernel density estimates made with
software-default choices of multivariate bandwidth covariance matrices.
In the event that N is big, dimension p is low, and K is small, one might at
least consider estimating the densities p (xjk) (with, say, kernel density estimates
p\(xjk)) using relative frequencies of values of y (say P \ [y = k]) to estimate the
probabilities P [y = k], and employing a classi…er like the one in display (38).
Figure 8 shows for a sample of size N = 100 from both the bivariate density of
2
Figure 6 and a uniform density on [ 3; 3] density estimates and their ratio (an
approximate likelihood ratio). Since the denominator density is uniform the
actual likelihood ratio is equivalent to the density portrayed in Figure 6 and it
seems like (for this sample at least) an approximation to a Bayes classi…er based
on density estimates might work reasonably well in this particular problem.
It is worth considering the form that such estimated-density-approximately-
Bayes classi…ers take in the case where symmetric Gaussian kernels are used.

46
Figure 7: 6 samples of size N = 100 from the bivariate density of Figure 6
and density estimates made using the kde2d function in the MASS package with
default choice of "bandwidth" covariance matrix.

Figure 8: Two N = 100 density estimates and their ratio for classifcations
2
between Uniform [ 3; 3] and the distribution of Figure 6.

That is, consider the case where one uses as a multivariate density estimate
N
d 1 X 2
q (x) = xjxi ; I
N i=1

(where ( j ; ) is the MVNp density with mean vector and covariance matrix
). A bit of algebra shows that with this kind of multivariate estimates of
class-conditional densities (based on the parts of the training set with y = k)
(and using training set relative frequencies to estimate class probabilities) the
approximately Bayes classi…er is
X 1 2
f^ (x) = arg max exp 2
kx xi k
k 2
i s.t. yi =k

This is a plausible kind of form, classifying to class k when x is "close to"

47
relatively many training inputs from class k. The bandwidth might be chosen
based on cross-validation of classi…er performance.
The statistical folklore is that this kind of classi…er can work poorly in high
dimensions because of the imprecisions (large variances) of the density estima-
tors. The "estimated density" approximations to the optimal rule are based
on what are usually low-bias-but-high-variance estimators. As such, the corre-
sponding classi…ers are very ‡exible, but can perform poorly for small training
sets. Less ‡exible classi…cation methods will often perform much better in prac-
tical problems (although those methods may be incapable of approximating the
optimal rule for all cases, even if N is huge).
There is a variant of form (38) that is thought to sometimes be e¤ective even
when p is not small (and p-dimensional density estimation is hopeless). The
basic idea is to estimate 1-dimensional marginals of the p (xjk)s and use their
products in place of p\(xjk)s. That is, if for each k the density p (xjk) : <p ! <+
has marginal densities p1 (x1 jk) ; p2 (x2 jk) ; : : : ; pp (xp jk) (each mapping < !
<+ ), while it may not be feasible to estimate p (xjk), it could be possible to
e¤ectively estimate p1 (x1 jk) ; p2 (x2 jk) ; : : : ; pp (xp jk). If this is the case, the
classi…er
Yp
f^ (x) = arg maxP \ [y = k] pj\ (xj jk)
k j=1

might be employed. (That is, one might treat elements xj of x as if they were
independent for every k, and multiply together kernel estimates of marginal
densities.) This has been called a "naive Bayes" classi…er.
The method seems to have a reputation for often being useful. But there
will certainly be situations where it doesn’t work very well because of failure
to account for strong dependencies between input variables. Figure 9 shows
the common marginal for x1 and x2 corresponding to the distribution of Figure
6. Figure 10 then shows the original bivariate density and the distribution of
independence with the same marginals. The product density is clearly quite
di¤erent from the original and estimation of the marginals alone can at best
only reproduce the product form.

Figure 9: The common marginal pdf for both x1 and x2 for the bivariate distri-
bution of Figure 6.

48
Figure 10: Original bivariate density from Figure 6 and a product density based
on the marginal(s) (as pictured in Figure 9).

1.7 Plotting to Portray the E¤ects of Particular Inputs in


Prediction
An issue brie‡y discussed in HTF Ch 10 is the making and plotting of functions
a few of the coordinates of x in an attempt to understand the nature of the
in‡uence of these in a predictor. (What they say is really perfectly general,
not at all special to the particular form of predictors discussed in that chapter.)
If, for example, I want to understand the in‡uence the …rst two coordinates of
x have on a form f (x), I might think of somehow averaging out the remaining
coordinates of x. One theoretical implementation of this idea would be

f 12 (x1 ; x2 ) = E(x3 ;x4 ;:::;xp ) f (x1 ; x2 ; x3 ; x4 ; : : : ; xp )

i.e. averaging according to the marginal distribution of the excluded input


variables. An empirical version of this is
N
1 X
f (x1 ; x2 ; x3i ; x4i ; : : : ; xpi )
N i=1

This might be plotted (e.g. in contour plot fashion) and the plot called a partial
dependence plot for the variables x1 and x2 . HTF’s language is that this
function details the dependence of the predictor on (x1 ; x2 ) "after accounting
for the average e¤ects of the other variables." This thinking amounts to a
version of the kind of thing one does in ordinary factorial linear models, where
main e¤ects are de…ned in terms of average (across all levels of all other factors)
means for individual levels of a factor, two-factor interactions are de…ned in
terms of average (again across all levels of all other factors) means for pairs of
levels of two factors, etc.
Something di¤erent raised by HTF is consideration of

fe12 (x1 ; x2 ) = E [f (x) jx1 ; x2 ]

49
(This is, by the way, the function of (x1 ; x2 ) closest to f (x) in L2 (P ).) This is
obtained by averaging not against the marginal of the excluded variables, but
against the conditional distribution of the excluded variables given x1 and x2 .
No workable empirical version of fe12 (x1 ; x2 ) can typically be de…ned. And it
should be clear that this is not the same as f 12 (x1 ; x2 ). HTF say this in some
sense describes the e¤ects of (x1 ; x2 ) on the prediction "ignoring the impact
of the other variables." (In fact, if f (x) is a good approximation for E[yjx],
this conditioning produces essentially E[yjx1 ; x2 ] and fe12 is just a predictor of
y based on (x1 ; x2 ).)
The di¤erence between f 12 and fe12 is easily seen through resort to a simple
example. If, for example, f is additive of the form

f (x) = h1 (x1 ; x2 ) + h2 (x3 ; : : : ; xp )

then

f 12 (x1 ; x2 ) = h1 (x1 ; x2 ) + Eh2 (x3 ; : : : ; xp )


= h1 (x1 ; x2 ) + constant

while
fe12 (x1 ; x2 ) = h1 (x1 ; x2 ) + E [h2 (x3 ; : : : ; xp ) jx1 ; x2 ]
and the fe12 "correction" to h1 (x1 ; x2 ) is not necessarily constant in (x1 ; x2 ).
The upshot of all this is that partial dependence plots are potentially helpful,
but that one needs to remember that they are produced by averaging according
to the marginal of the set of variables not under consideration.

2 Some Linear Theory, Linear Algebra, and Prin-


cipal Components
Methods of modern multivariate statistical learning often involve more back-
ground in the theory of linear spaces and linear algebra than is assumed or used
in a basic linear models course. So here we provide some of that and then apply
it to the unsupervised learning problem of "principal components analysis."

2.1 Inner Product Spaces


Most of applied mathematics in general and statistical machine learning in par-
ticular is built on the notions of "linear combinations" of various objects and
"inner products" of these (that in turn lead to coherent notions of their "sizes"
and of "distances" between them). Here we brie‡y review what is necessary
for a theory of such objects and operations to make sense.
First, a vector (or linear) space V consists of objects v; w; : : : such that if
v 2 V and a 2 <, then the object av makes sense and belongs to V , and for v
and w in V the object v+w also makes sense and belongs to V . The archetypal

50
vector spaces are the Euclidean spaces <p where elements are "ordinary" p-
dimensional vectors. But other kinds of vector spaces are useful in statistical
machine learning as well, including function spaces. Take for example the set
of functions on [0; 1] that have …nite integrals of their squares. (This space
is sometimes known as L2 ([0; 1]).) More or less obviously, if g : [0; 1] ! <
R1 2
with 0 (g (x)) dx < 1 and a 2 <, then ag (x) makes sense, maps [0; 1] to <
R1 2 R1 2
and has 0 (ag (x)) dx = a2 0 (g (x)) dx < 1. Further, if g : [0; 1] ! <
R1 2 R1 2
with 0 (g (x)) dx < 1 and h : [0; 1] ! < with 0 (h (x)) dx < 1, then the
function g (x) + h (x) makes sense, maps [0; 1] to < and has …nite integral of its
square.
The notion of an inner product (of pairs of elements of a vector space V ) is
that of a symmetric (bi-)linear positive de…nite function hv; wi mapping V
V ! <. That is, hv; wi is an inner product on the vector space V if it satis…es

1. hw; vi = hv; wi 8 v; w 2 V (symmetry),

2. hav; wi = a hv; wi 8 v; w 2 V and a 2 <,


hv + u; wi = hv; wi + hu; wi 8 u; v; and w 2 V (bilinearity), and
3. hv; vi 0 8 v 2 V and hv; vi = 0 if and only if v = 0 (positive de…nite-
ness).

Of course Euclidean p-space is a vector space with inner product de…ned as the
"dot-product" of p-dimensional vectors v and w, namely
p
X
hv; wi = v 0 w = vj wj
j=1

It is possible to argue that in the case of the L2 ([0; 1]) function space, the
integral of the product of two elements provides a valid inner product, that is
Z 1
hg; hi g (x) h (x) dx
0

satis…es 1. through 3.
An inner product on a vector space V leads immediately to notions of size
and distance in the space. The norm (i.e. the "size" or "length") of an element
of V can be coherently de…ned as
p
kvk hv; vi

Then the distance between two elements of V can be taken to be the size of the
di¤erence between them. That is, the distance between v and w belonging to
V (say d (v; w)) derived from the inner product is

d (v; w) = kv wk

51
This satis…es all the properties necessary to qualify as a "metric" or "distance
function," including the important triangle inequality.
In Euclidean p-space, the norm is the geometrical length of a p-vector (the
root of the sum of the p squared entries of the vector) and the associated distance
is ordinary Euclidean distance. In the case of the L2 ([0; 1]) function space, the
norm/size of an element g is
s
Z 1
2
kgk = (g (x)) dx
0

and the distance between elements g and h is


s
Z 1
2
d (g; h) = (g (x) h (x)) dx
0

Many other useful notions commonly understood in Euclidean spaces gen-


eralize directly to more abstract vector spaces and inner product spaces. v
and w 2 V are perpendicular or orthogonal when hv; wi = 0. Subspaces of
V can be generated as all linear combinations of a set of elements of V and
are commonly referred to as the "span" of the set of elements. A basis for a
subspace of V is a set of linearly independent vectors (no linear combination
of them is the 0 vector) that span the subspace. "Orthonormal" bases (whose
elements are perpendicular and each of norm 1) for V (or for subspaces of V )
are particularly attractive, as they provide very simple representations for "pro-
jections" of v 2 V onto the span of any set of them, as a linear combination of
basis vectors where coe¢ cients are the inner products with the corresponding
basis vectors. In the context of machine learning, projections of a vector v are
very usefully thought of as "low-dimensional" approximations to v (in terms
of a "few" basis vectors). (The dimension of a subspace of V is, just as in
ordinary Euclidean spaces, the number of vectors in a basis for it.) Geometry
of Euclidean cases (where subspaces are geometrical hyperplanes containing the
origin and geometrical hyperplanes are subspaces potentially shifted from the
origin by addition of a vector not in the subspace) is helpful in interpreting
statistical machine learning constructs in more abstract inner product spaces.

2.2 The (General) Gram-Schmidt Process and the QR


Decomposition of a rank = p Matrix X
We continue to use the notation
0 0 1 0 1
x1 y1
B x02 C B y2 C
B C B C
X = B . C and potentially Y = B .. C
N p @ .. A N 1 @ . A
0
xN yN
and recall that it is standard linear models fare that ordinary least squares
projects Y onto C (X), the column space of X, in order to produce the vector

52
of …tted values 0 1
yb1
B yb2 C
B C
Yb = B .. C
N 1 @ . A
ybN
For many purposes it would be convenient if the columns of a full rank
(rank = p) matrix X were orthogonal. In fact, it would be useful to replace
the N p matrix X with an N p matrix Z with orthogonal columns and having
the property that for each l if X l and Z l are N l consisting of the …rst l columns
of respectively X and Z, then C (Z l ) = C (X l ). Such a matrix can in fact be
constructed using the so-called Gram-Schmidt process. This process generalizes
beyond the present application to <N to general inner product spaces, and in
recognition of that important fact we’ll …rst describe it in general terms and
then consider its implications for a (rank = p) matrix X.
Consider p vectors x1 ; x2 ; : : : ; xp (that could be N -vectors where xj is the
jth column of X)14 . The Gram-Schmidt process proceeds as follows:

1. Set
1=2 1
z 1 = x1 and q 1 = hz 1 ; z 1 i z1 = z1
kz 1 k

2. Having constructed fz 1 ; z 2 ; : : : ; z l 1 g, let


l 1
X l 1
X
hxl ; z j i 1
z l = xl z j = xl xl ; q j q j and q l = zl
j=1
hz j ; zj i j=1
kz l k

Figure 11 illustrates this construction for a simple case of p = 2. z l is


the part of xl that "sticks out of the subspace spanned by z 1 ; z 2 ; : : : ; z l 1 " the
di¤erence between xl and the perpendicular projection of that vector onto the
subspace. q l is the normalized version of z l , the unit vector pointing in the
same direction as z l .
It is easy enough to see that hz l ; z j i = 0 for all j < l (building up the
orthogonality of z 1 ; z 2 ; : : : ; z l 1 by induction), since

hz l ; z j i = hxl ; z j i hxl ; z j i

as at most one term of the sum in step 2. above is non-zero. Further, assume
that the span of fz 1 ; z 2 ; : : : ; z l 1 g is the same as the span of fx1 ; x2 ; : : : ; xl 1 g.
z l is in the span of fx1 ; x2 ; : : : ; xl g so that the span of fz 1 ; z 2 ; : : : ; z l g is a
subset of the span of fx1 ; x2 ; : : : ; xl g. And since any element of the span of
fx1 ; x2 ; : : : ; xl g can be written as a linear combination of an element of the span
of fz 1 ; z 2 ; : : : ; z l 1 g (span of fx1 ; x2 ; : : : ; xl 1 g) and xl we also have that the
1 4 Notice that this is in potential con‡ict with earlier notation that made x the p-vector
i
of inputs for the ith case in the training data. We will simply have to read the following in
context and keep in mind the local convention.

53
Figure 11: A p = 2 illustration of the Gram-Schmidt construction of an ortho-
normal basis for the subspace space spanned by x1 and x2 .

span of fx1 ; x2 ; : : : ; xl g is a subset of the span of fz 1 ; z 2 ; : : : ; z l g. That is that


fz 1 ; z 2 ; : : : ; z l g and fx1 ; x2 ; : : : ; xl g have the same span and the set of vectors

q1 ; q2 ; : : : ; ql

form an orthonormal basis for the span of fx1 ; x2 ; : : : ; xl g.


Since the z j are perpendicular, for any vector w,

Xl Xl
hw; z j i
zj = w; q j q j (39)
j=1
hz j ; z j i j=1

is the projection of w onto the span of fz 1 ; z 2 ; D : : : ; z l g (of fx1 ; x2 ; : : : ; xl g). (ToE


Pl Pl
see this, consider minimization of the quantity w j=1 cj z j ; w j=1 cj z j =
Pl 2 Pl 1
w j=1 cj z j by choice of the constants cj .) In particular, j=1 xl ; q j q j
in step 2. of the Gram-Schmidt process is the projection of xl onto the span of
fx1 ; x2 ; : : : ; xl 1 g.
Consider now the case where Euclidean N -vectors x1 ; x2 ; : : : ; xp are the
columns of a data matrix X. Spans are column spaces of matrices. Indeed the
N p matrix Z has orthogonal columns and the property that C (Z l ) = C (X l ).
And the set of vectors fq 1 ; q 2 ; : : : ; q l g is an orthonormal basis for this column

54
space. So the projection of a vector of outputs Y onto C (X l ) is

Xl Xl
hY ; z j i
zj = Y ; qj qj
j=1
hz j ; z j i j=1

This means that for a full p-variable regression,

hY ; z p i
hz p ; z p i

is the regression coe¢ cient for z p and (since only z p involves it) the last vari-
able in X, xp . So, in constructing a vector of …tted values, …tted regression
coe¢ cients in multiple regression can be interpreted as weights to be applied to
that part of the input vector that remains after projecting the predictor onto
the space spanned by all the others.
The construction of the orthogonal variables z j can be represented in matrix
form as
X = Z
N p N pp p

where is upper triangular with

kj = the value in the kth row and jth column of


8
< 1 if j = k
= hz k ; xj i
: if j > k
hz k ; z k i

De…ning
1=2 1=2
D = diag hz 1 ; z 1 i ; : : : ; hz p ; z p i = diag (kz 1 k ; : : : ; kz p k)

and letting
1
Q = ZD and R = D
one may write
X = QR (40)
that is the so-called QR decomposition of X.
Note that the notation used here is consistent, in that for q j the jth column
1=2
of Q, q j = (hz j ; z j i) z j as was used in de…ning the Gram-Schmidt process.
In display (40), Q is N p with

Q0 Q = D 1
Z 0 ZD 1
=D 1
diag (hz 1 ; z 1 i ; : : : ; hz p ; z p i) D 1
=I

consistent with the fact that Q has for columns perpendicular unit vectors that
form a basis for C (X). R is upper triangular and that says that only the …rst
l of these unit vectors are needed to create xl .

55
The decomposition is computationally useful in that the projection of a
response vector Y onto C (X) is
p
X
Yb = Y ; q j q j = QQ0 Y (41)
j=1

and
b ols = R 1
Q0 Y
(The fact that R is upper triangular implies that there are e¢ cient ways to
compute its inverse.)

2.3 The Singular Value Decomposition of X


If the N p matrix X has rank r then it has a so-called singular value decom-
position as
X = U D V0
N p N rr rr p

where U has orthonormal columns (left singular vectors) spanning C (X), V


has orthonormal columns (right singular vectors) spanning C X 0 (the row
space of X), and D = diag (d1 ; d2 ; : : : ; dr ) for

d1 d2 dr > 0

the dj are the "singular values" of X.15


An interesting property of the singular value decomposition is this. If U l and
V l are matrices consisting of the …rst l r columns of U and V respectively,
then
X l = U l diag (d1 ; d2 ; : : : ; dl ) V 0l
is the best (in the sense of squared distance from X in <N p ) rank = l approx-
imation to X. (Note that application of this kind of argument to covariance
matrices provides low-rank approximations to complicated covariance matrices.)
Since the columns of U are an orthonormal basis for C (X), the projection
of an output vector Y onto C (X) is
r
X
ols
Yb = hY ; uj i uj = U U 0 Y (42)
j=1

In the full rank (rank = p) X case, this is of course, completely parallel to


representation (41) and is a consequence of the fact that the columns of both U
and Q (from the QR decomposition of X) form orthonormal bases for C (X).
In general, the two bases are not the same.
1 5 For a real non-negative de…nite square matrix (a covariance matrix), the singular value
decomposition is the eigen decomposition, U = V , columns of these matrices are unit eigen-
vectors, and the SVD singular values are the corresponding eigenvalues.

56
Now using the singular value decomposition of a full rank (rank p) X,

X 0 X = V D 0 U 0 U DV 0
= V D2 V 0 (43)

which is the eigen (or spectral) decomposition of the symmetric and positive
de…nite X 0 X. (The eigenvalues are the squares of the SVD singular values.)
The vector
z 1 Xv 1
is the product Xw with the largest squared length in <N subject to the con-
straint that kwk = 1: A second representation of z 1 is
0 1
1
B 0 C
B C
z 1 = Xv 1 = U DV 0 v 1 = U D B . C = d1 u1
@ .. A
0

and we see that this largest squared length is d21 and the vector points in the
direction of u1 . In general,
0 1
hx1 ; v j i
B .. C
z j = Xv j = @ . A = dj uj (44)
hxN ; v j i

is the vector of the form Xw with the largest squared length in <N subject to
the constraints that kwk = 1 and hw; z l i = 0 for all l < j. The squared length
is d2j and the vector points in the direction of uj .

2.3.1 The Singular Value Decomposition and General Inner Product


Spaces
It is potentially useful to consider the relevance of the SVD for matrices to
geometry in abstract inner product spaces (e.g. because of the machine learning
practice of adopting features that are not elements of a Euclidean space, but
rather functions). So, suppose that N vectors w1 ; w2 ; : : : ; wN span a subspace
of the inner product space A of dimension r and that e1 ; e2 ; : : : ; er form an
orthonormal basis for that subspace. Consider then the matrix

X = hwi ; ej iA i=1;2;:::;N (45)


N r j=1;2;:::;r

X represents the vectors w1 ; w2 ; : : : ; wN in the sense that its rows give coe¢ -
cients to be applied to the elements of the orthonormal basis (the es) in order
to make linear combinations that are the ws.

57
Now, as above, consider the SVD of X and some related elements of A.
Begin with elements of A related to the right singular vectors v j 2 <r . Corre-
sponding to them are vectors
r
X
aj = vjl el
l=1

(the real entries of v j supplying coe¢ cients for the es in order to make up aj
as a linear combination of the basis vectors). Notice that
r
X
haj ; aj 0 iA = vjl vj 0 l = I [j = j 0 ]
l=1

and so a1 ; a2 ; : : : ; ar form a second orthonormal basis for A.


Now
N
X 2
2
hwi ; a1 iA = (hwi ; a1 iA )i=1;2;:::;N
i=1

r
! 2
X
= v1l hwi ; el iA
l=1 i=1;2;:::;N
2
= kXv 1 k
has the maximum value of
N
* r
+2 * r
+ ! 2
X X X
wi ; cl el = wi ; cl el
i=1 l=1 A l=1 A i=1;2;:::;N

r
! 2
X
= cl hwi ; el iA
l=1 i=1;2;:::;N
2
= kXck
Pr
possible for c a unit vector in <r and thus l=1 cl el a unit vector in A. That is,
a1 is a unit vector in A pointing in a direction such that the projections of the
wi onto the 1-dimensional subspace of multiples of it have the largest possible
sum of squared norms. In general, aj is a unit vector in A perpendicular to all
of a1 ; a2 ; : : : ; aj 1 with maximum sum of squared norms for the projections of
the wi onto the 1-dimensional subspace of multiples of it.
In a case where a transform T maps <p to an inner product space A, and
one is interested in the subspace of A of dimension r N spanned by the image
of the set of training input vectors
fT (x1 ) ; T (x2 ) ; : : : ; T (xN )g ;
with wi = T (xi ), the foregoing then translates the SVD of the matrix (45)
into abstract inner product space geometrical insights concerning transformed
training vectors.

58
2.4 Matrices of Centered Columns and Principal Compo-
nents
In the event that all the columns of X have been centered (each 10 xj = 0 for
xj the jth column of X), there is additional terminology and insight associated
with singular value decompositions as describing the structure of X. Note
that centering is often sensible in unsupervised learning contexts because the
object is to understand the internal structure of the data cases xi 2 <p , not
the location of the data cloud (that is easily represented by the sample mean
vector). So accordingly, we …rst translate the data cloud to the origin.
Principal components ideas are then based on the singular value decom-
position of X
X = U D V0
N p N rr rr p

(and related spectral/eigen decompositions of X 0 X and XX 0 ).

2.4.1 "Ordinary" Principal Components


The columns of V (namely v 1 ; v 2 ; : : : ; v r ) are called the principal component
directions in <p of the xi , and the elements of the vectors z j 2 <N from display
(44), namely the inner products hxi ; v j i, are called the principal components
of the xi . (The ith element of z j , hxi ; v j i, is the value of the jth principal
component for case i, or the corresponding principal component score. The
entries of the p 1 vector v j are sometimes called the component weights or
loadings for the jth component. A 0 loading means that the corresponding
column of X is ignored in the creation of z j .) Notice that hxi ; v j i v j is the
projection of xi onto the 1-dimensional space spanned by v j .
Figure 12 provides a summary of the language just introduced. (The N p
matrix of inner products hxi ; v j i is U D.)

Figure 12: Summary of principal components language.

Figure 13 shows scatterplots of a raw (red) and corresponding standardized


(blue) p = 2 dataset. The red arrow points in the direction of the raw data

59
…rst right singular vector (i.e. points "at" the raw data). The blue arrow is in
the …rst principal component direction of the standardized data (pointing
in the direction of their greatest variation).

Figure 13: Example of a small p = 2 dataset (red dots) and standardized version
(blue dots) and (multiples of) the …rst right singular vector of the dataset and
the …rst principal direction of the standardized dataset.

It is worth thinking a bit more about the form of the product


l
X = U l diag (d1 ; d2 ; : : : ; dl ) V 0l

that we’ve already said is the best rank l approximation to X. In fact it is


0 0 1
v1
X l l
X B v 02 C
B C
X l= dj uj v 0j = z j v 0j = (z 1 ; z 2 ; : : : ; z l ) B . C
j=1 j=1
@ .. A
v 0l
Pl
and its ith row is j=1 hxi ; v j i v 0j , which (since the v j are orthonormal) is the
transpose of the projection of xi onto C (V l ). That is,
0 1 0 1 0 1
hx1 ; v 1 i hx1 ; v 2 i hx1 ; v l i
B .. C 0 B .. C 0 B .. C 0
X l=@ . A v1 + @ . A v2 + +@ . A vl
hxN ; v 1 i hxN ; v 2 i hxN ; v l i
= z 1 v 01 + z 2 v 02 + + z l v 0l
= Xv 1 v 01 + Xv 2 v 02 + + Xv l v 0l

60
a sum of rank 1 summands, producing for X l a matrix with each xi in X
replaced by the transpose of its projection onto C (V l ).
Since z j = dj uj , z j v 0j = dj uj v 0j . Then since the uj s and v j s are unit
vectors, the sum of squared entries of both z j and z j v 0j is d2j . These are non-
increasing in j. So the z j and z j v 0j decrease in "size" with j, and directions
v 1 ; v 2 ; : : : ; v r are successively "less important" in describing variation in the
xi and in reconstructing X. This agrees with common interpretation of cases
where a few singular values are much bigger than the others. There "simple
structure" in the data is that observations can be more or less reconstructed as
linear combinations of a few orthonormal vectors.
Figure 14 portrays a hypothetical p = 3 dataset. Shown are the N = 9 data
points, the rank = 1 approximation (black balls on the line de…ned by the …rst
PC direction) and the rank = 2 approximation (black stars on the plane).

Figure 14: Principal components approximations to a p = 3 dataset.

Izenman, in his discussion of "polynomial principal components" points out


that in some circumstances the existence of a few very small singular values can
also identify important simple structure in a dataset. Suppose, for example,
that all singular values except dp 0 are of appreciable size. One simple feature
of the dataset is then that all hxi ; v p i 0, i.e. there is one linear combination
of the p coordinates xj that is essentially constant (namely hx; v p i). The
data fall nearly on a (p 1)-dimensional hyperplane in <p . In cases where
the p coordinates xj are not functionally independent (for example consisting
of centered versions of 1) all values, 2) all squares of values, and 3) all cross
products of values of a smaller number of functionally independent variables), a

61
single "nearly 0" singular value identi…es a quadratic function of the functionally
independent variables that must be essentially constant, a potentially useful
insight about the dataset.
To summarize interpretation of principal components of a centered dataset,
one can say the following:

Principal components analysis amounts to the development of an


alternative coordinate system in which to represent a p-dimensional
dataset. One e¤ectively …nds a rotation of the original coordi-
nate system to a new one where axes are de…ned by the p-vectors
v 1 ; v 2 ; : : : ; v r in which variation of the data in the directions v j
decreases with increasing j (as much as possible with each incre-
ment of j). The N -vectors uj are unit vectors and their multiples
z j = dj uj are the vectors of coordinates of the N data vectors in the
new/rotated coordinate system. (And the dj are the magnitudes of
these vectors of new coordinates in <N .)

X 0 X and XX 0 and Principal Components The singular value decom-


position of X means that both X 0 X and XX 0 have useful representations in
terms of singular vectors and singular values. Consider …rst X 0 X (that is most
of the sample covariance matrix). As noted in display (43), the SVD of X
means that
X 0 X = V D2 V 0
and it’s then clear that the columns of V are eigenvectors of X 0 X and the
squares of the diagonal elements of D are the corresponding eigenvalues. An
eigen analysis of X 0 X then directly yields the principal component directions
of the data, and through the further computation of the inner products in (44),
the principal components z j (and hence the singular vectors uj ) are available.
Note that
1 0
XX
N
is the (N -divisor) sample covariance matrix16 for the p input variables x1 ; x2 ; : : : ; xp .
The principal component directions of X in <p , namely v 1 ; v 2 ; : : : ; v r , are also
unit eigenvectors of the sample covariance matrix. The squared lengths of
the principal components z j in <N divided by N are the (N -divisor) sample
variances of entries of the z j , and their values are

1 0 1 d2j
z j z j = dj u0j uj dj =
N N N
The SVD of X also implies that

XX 0 = U DV 0 V DU 0 = U D 2 U 0
1 6 Notice that when X has standardized columns (i.e. each column of X, x , has
j
1
h1; xj i = 0 and hxj ; xj i = N ), the matrix N X 0 X is the sample correlation matrix for the p
input variables x1 ; x2 ; : : : ; xp .

62
and it’s then clear that the columns of U are eigenvectors of XX 0 and the
squares of the diagonal elements of D are the corresponding eigenvalues. U D
then produces the N r matrix of principal components of the data. The
principal component directions are unavailable (even indirectly) based only on
this second eigen analysis.

2.4.2 "Kernel" Principal Components


Consider …rst the possibility of using a nonlinear function : <p ! <M to map
data vectors x to (a usually higher-dimensional) vector of features (x). Of
course, this creates a new N M data/feature matrix
0 0 1
(x1 )
B 0 (x2 ) C
B C
=B .. C
@ . A
0
(xN )
with entries of belonging to <. After centering via

e = 1 1
J = I J (46)
N N

for J an N N matrix of 1s, one can make a SVD of e , producing singular


values and both sets of singular vectors for the new feature matrix.
Now, thinking as in Section 1.4.3, suppose K is a kernel function and one
maps data vectors x to elements K (x; ) in the abstract (function) feature space
A. One can think of …nding "principal components" for the transformed train-
ing set in this feature space. First, the function
N
1X
K( ) K (xi ; )
N i=1

is a well-de…ned linear combination of the images of the training set in A and


therefore a sensible "center" of the transformed training set. The functions
K (xi ; ) K( ) (47)
are then sensible centered abstract feature values for the training set. Next,
corresponding to the matrix of inner products for a centered set of N points
in a Euclidean space is the N N matrix of inner products of these centered
feature values in the abstract space A,
C K (xi ; ) K ( ) ; K (xj ; ) K( ) A i=1;:::;N
(48)
j=1;:::;N

Then using the basic reproducing kernel fact that hK (x; ) ; K (z; )iA = K (x; z)
and the notation K for the Gram matrix (21), it is easy enough to …nd the
representation
1 1 1
C=K JK KJ + 2 J KJ (49)
N N N

63
for the symmetric non-negative de…nite C. Finally, an eigen analysis will
produce principal components (N vectors of length N of scores) for the training
data expressed in the abstract feature space.
To realize the entries in these eigen vectors of kernel principal component
scores as inner products of the N functions (47) with "principal component
directions" in the abstract feature space, A, one may return to Section 2.3.1
and begin with any orthonormal basis E1 ( ) ; E2 ( ) ; : : : ; EN ( ) for the span of the
functions (47) (coming, for example, from use of the Gram-Schmidt process).
Then the general inner product space argument beginning with an N N matrix
with entries K (xi ; ) K ( ) ; Ej ( ) A produces N basis functions V1 ( ) ; V2 ( ) ;
: : : ; VN ( ) whose A inner products with functions (47) are (up to a sign for each
Vj ( )) the entries of the eigen vectors of C. In cases with small p it may be of
interest to examine these abstract principal component direction functions via
some plotting.

2.4.3 Graphical (Spectral) Features


Another variant of principal components ideas concerns "graphical spectral fea-
tures" of a dataset built on thinking of data cases as corresponding to vertices
on a graph. This material has emphases in common with the local version of
multi-dimensional scaling treated in Section 17.3, and can sometimes provide a
way to separate "unconventional" but distinct structures of data points in <p .
The basic motivation is to not necessarily look for "convex" groups of points in
p-space, but rather for "roughly connected"/"contiguous" sets of points of any
shape in p-space.
Begin with N vectors x1 ; x2 ; : : : ; xN in <p . Consider weights wij = w (kxi xj k)
for a decreasing function w : [0; 1) ! [0; 1] and use them to de…ne similar-
ities/adjacencies sij . (For example, we might use w (d) = exp d2 =c for
some c > 0.) Similarities can be exactly sij = wij , but can be even more
"locally" de…ned as follows. For …xed k consider the symmetric set of index
pairs
the number of j 0 with wij 0 > wij is less than k
Nk = (i; j) j
or the number of i0 with wi0 j > wij is less than k
(an index pair is in the set if one of the items is in the k-nearest neighbor
neighborhood of the other). One might then de…ne sij = wij I [(i; j) 2 Nk ].
In any case, we’ll call the matrix
S = (sij ) i=1;:::;N
j=1;:::;N

the adjacency matrix, and use the notation


N
X
gi = sij
j=1

and
G = diag (g1 ; g2 ; : : : ; gN )

64
It is common to think of the points x1 ; x2 ; : : : ; xN in <p as nodes/vertices on
a graph, with edges between nodes weighted by similarities sij , and the gi so-
called node degrees, i.e. sums of weights of the edges connected to nodes i.
In such thinking, sij = 0 indicates that there is no "edge" between case i and
case j.
The matrix
L=G S
is called the (unnormalized) graph Laplacian, and one standardized (with
respect to the node degrees) version of this is
e =G
L 1
L=I G 1
S
and a second standardized version is
1=2 1=2 1=2 1=2
L =G LG =I G SG (50)
Note that for any vector u,
N
X N X
X N
0
u Lu = gi u2i ui uj sij
i=1 i=1 j=1
0 1
N N N X N N X
N
1 @X X X X
= sij u2i + sij u2j A ui uj sij
2 i=1 j=1 j=1 i=1 i=1 j=1
N N
1 XX 2
= sij (ui uj ) (51)
2 i=1 j=1

so that the N N symmetric L is nonnegative de…nite. Consider the spec-


tral/eigen decomposition of L and focus on the small eigenvalues. For v 1 ; : : : ; v m
eigenvectors corresponding to the 2nd through (m + 1)st smallest non-zero eigen-
values (since L1 = 0 there is an uninteresting 0 eigenvalue), let
V = (v 1 ; : : : ; v m )
These are "graphical spectral features" and one might think of cases with simi-
lar rows of V as "alike." As we noted in the discussion in Section 2.4.1, small
eigenvalues are associated with linear combinations of columns of L that are
close to 0.
Why should this work to identify connected structures in a training set? For
v l a column of V that is a eigenvector of L corresponding to a small eigenvalue
l , by virtue of relationship (51)

N N
1 XX 2
l = v 0l Lv l = sij (vli vlj ) 0 (52)
2 i=1 j=1

and points xi and xj with large adjacencies must have similar corresponding
coordinates of the eigenvectors. HTF (at the bottom of their page 545) essen-
tially argue that the number of "0 or nearly 0" eigenvalues of L is indicative

65
of the number of connected structures in the original N data vectors. A series
of points could be (in sequence) close to successive elements of the sequence
but have very small adjacencies for points separated in the sequence. "Struc-
tures" by this methodology need NOT be "clumps" of points, but could also be
serpentine "chains" of points in <p .
A second version of this is easily built on the symmetric normalized
Laplacian (50), L . Its eigenvalues are nonnegative and it has a 0 eigen-
value. Let 1 m be the 2nd through (m + 1)st smallest eigenvalues
and v 1 ; : : : ; v m be corresponding eigenvectors. Then for l such a small non-
negative eigenvalue,
N N 2
1=2 1=2 1 XX v vlj
l = vl 0 L vl = vl 0 G LG vl = sij pli p 0
2 i=1 j=1 gi gj
(53)
and points xi and xj with large adjacencies must have similar corresponding
coordinates of the vector G 1=2 v l . So one might treat vectors G 1=2 v l (or
perhaps normalized versions of them) as a second version of m graphical features.
It is also easy to see that
P G 1S
is a stochastic matrix and thus specifying an N -state stationary Markov Chain.
e = I P identi…es groups
It is plausible that the standardized graph Laplacian L
of states such that transition by such a chain between the groups is relatively
infrequent (the MCMC more typically moves within groups).

Part II
Supervised Learning I: Basic
Prediction Methodology
3 (Non-OLS) SEL Linear Predictors
There is more to say about the development of a linear predictor

fb(x) = x0 ^

for an appropriate ^ 2 <p than what is said in books and courses on ordinary
linear models (where ordinary least squares is used to …t the linear form to all
p input variables or to some subset of M of them). We continue the basic
notation of Section 2, where the (supervised learning) problem is prediction,
and there is a vector of continuous outputs, Y , of interest.

66
3.1 Ridge Regression, the Lasso, and Some Other Shrink-
ing Methods
An alternative to seeking to …nd a suitable level of complexity in a linear pre-
diction rule through subset selection and least squares …tting of a linear form
to the selected variables, is to employ a shrinkage method based on a penalized
version of least squares to choose a vector ^ 2 <p to employ in a linear predic-
tion rule. Here we consider several such methods, all of which have parameters
ols
that function as complexity measures and allow ^ to range between 0 and ^
depending upon complexity.
The implementation of these methods is not equivariant to the scaling used
to express the input variables xj . So that we can talk about properties of the
methods that are associated with a well-de…ned scaling, we assume here that
the output variable has been centered (i.e. that hY ; 1i = 0) and that the
columns of X have been standardized (and if originally X had a constant
column, it has been removed).

3.1.1 Ridge Regression


ridge
For a > 0 the ridge regression coe¢ cient vector b 2 <p is

b ridge = arg min (Y X ) (Y


0
X )+ 0
(54)
2<p

ols
Here is a penalty/complexity parameter that controls how much b is shrunken
towards 0. The unconstrained minimization problem expressed in (54) has an
equivalent constrained minimization description as

b ridge = arg min (Y


0
X ) (Y X ) (55)
t
with k k2 t

ridge 2
for an appropriate t > 0. (Corresponding to used in form (54), is t = b
used in display (55). Conversely, corresponding to t used in form (55), one
may use a value of in display (54) producing the same error sum of squares.)
Figure 15 is a representation of the constrained version of the ridge optimization
problem for p = 2. Pictured are a contour plot for the quadratic error sum of
ols
squares (Y X ) (Y X ) function of , the constraint region for , b
0
ridge
and b t .
The unconstrained form (54) calls upon one to minimize
0 0
(Y X ) (Y X )+

and some vector calculus leads directly to

b ridge = X 0 X + I 1
X 0Y

67
Figure 15: Cartoon Representing the Constrained Version of Ridge Optimiza-
tion for p = 2

So then, using the singular value decomposition of X (with rank = r),


ridge ridge
Yb = Xb
1
= U DV 0 V DU 0 U DV 0 + I V DU 0 Y
1
= U D V 0 V DU 0 U DV 0 + I V DU 0 Y
1
= U D D2 + I DU 0 Y
r
!
X d2j
= 2 hY ; uj i uj (56)
j=1
dj +

Comparing to equation (42) and recognizing that

d2j+1 d2j
0< <1
d2j+1 + d2j +

we see that the coe¢ cients of the orthonormal basis vectors uj employed to
ridge ols
get Yb are shrunken version of the coe¢ cients applied to get Yb . The
most severe shrinking is enforced in the directions of the smallest principal
components of X (the uj least important in making up low rank approximations
to X). Since from representation (56)

r
!2
ridge 2 X d2j
Yb
2
= hY ; uj i
j=1
d2j +

the "size" of the ridge prediction vector for the N centered responses is decreas-
ing in .

68
Notice also from representation (56) that
r
!
ridge X 1
Yb = hY ; Xv j i Xv j
j=1
d2j +
r
!
X 1
=X hY ; Xv j i v j
j=1
d2j +

so that !
r
X
b ridge 1
= hY ; Xv j i v j
j=1
d2j +

and !2
r
X
ridge 2 1
b = hY ; Xv j i
2

j=1
d2j +

which is also clearly decreasing in . An upshot of these facts about "shrinking"


is that one can think of (the penalty parameter) as a complexity parameter
that de…nes paths in <N and <p from OLS predictions and coe¢ cients to degen-
erate (0) ones, passing through a spectrum of plausible (ridge) linear predictors.
There is an interesting "grouping e¤ect" associated with ridge regression.
This is that highly correlated inputs, say xj and xj 0 , (being already standardized
so they have sample standard deviation 1 across the training set) will have
ridge regression coe¢ cients of essentially the same magnitude. This can be
understood as follows. Without loss of generality, assume that xj and xj 0 are
highly positively correlated (so that they are essentially the same variable). For
any regression coe¢ cients j and j 0 and number (including j = ( j + j 0 ))
the contribution of xj and xj 0 to y^ (and thus the error sum of squares) is

j xj + j 0 xj 0 ( j + j 0 ) xj + (1 )( j + j 0 ) xj 0

But the contribution of ( j + j0 ) and (1 )( j + j0 ) to the sum of squared


regression coe¢ cients is

2 2 2 2 2 2 2
( j + j0 ) + (1 ) ( j + j0 ) = + (1 ) ( j + j0 )

which is minimum at = 1=2, where the coe¢ cients for xj and xj 0 are the
same.

69
The function
1
df ( ) = tr X X 0 X + I X0
1
= tr U D D 2 + I DU 0
0 ! 1
X r
d2j
= tr @ 2+ uj u0j A
j=1
dj
0 ! 1
X r
d2j
= tr @ 2+ u0j uj A
j=1
dj

r
!
X d2j
=
j=1
d2j +

is called the "e¤ective degrees of freedom" associated with the ridge regression.
In regard to this choice of nomenclature, note that if = 0 ridge regression
is ordinary least squares and this is r, the usual degrees of freedom associated
with projection onto C (X), i.e. trace of the projection matrix onto this column
space.
As ! 1, the e¤ective degrees of freedom goes to 0 as (the centered)
ridge
Yb goes to 0 (corresponding to a constant predictor). Notice also (for future
ridge ridge 1
reference) that since Yb = Xb = X X 0X + I X 0 Y = M Y for
1
M = X X 0X + I X 0 , if one assumes that
2
CovY = I

(conditioned on the xi in the training data, the outputs are uncorrelated and
have constant variance 2 ) then
N
1 X
e¤ective degrees of freedom = tr (M ) = 2
Cov (^
yi ; yi ) (57)
i=1

This follows since Yb = M Y and CovY = 2 I imply that


0 1 0 1
Yb M
Cov @ A= 2@ A I M 0 jI
Y I
2 MM0 M
=
M0 I

and the terms Cov(^ yi ; yi ) are the diagonal elements of the upper right block
of this covariance matrix. This suggests that tr(M ) is a plausible general
de…nition for e¤ective degrees of freedom for any linear …tting method Yb =
M Y , and that more generally, the last form in form (57) might be used in

70
situations where Yb is other than a linear form in Y . Further (reasonably
enough) the last form is a measure of how strongly the outputs in the training
set can be expected to be related to their predictions.
Further, in the linear case with Yb = M Y ,
N
X @ y^i
e¤ective degrees of freedom = tr (M ) =
i=1
@yi

and we see that the e¤ective degrees of freedom is some total measure of how
sensitive predictions are at the training inputs xi to the corresponding training
values yi . This raises at least the possibility that in nonlinear cases, an approx-
imate/estimated value of the general e¤ective degrees of freedom (57) might be
the random variable
XN
@ y^i
i=1
@yi Y

3.1.2 The Lasso, Etc.


The "lasso" (least absolute selection and shrinkage operator) and some other
¯ ¯ ¯ ¯ ¯
relatives of ridge regression are the result of P
generalizing the
Poptimization criteria
0 2 p 2 p q
(54) and (55) by replacing = k k = j=1 j with j=1 j j j for a q > 0.
That produces
8 9
q
< Xp =
b = arg min (Y X ) (Y X ) + 0
j jj
q
(58)
2<p : ;
j=1

generalizing form (54) and

bq = arg min (Y
0
X ) (Y X ) (59)
t P p q
with j=1 j j j t

generalizing form (55). The so called "lasso" is the q = 1 case of form (58)
and form (59) and in general, these have been called the "bridge regression"
problems. That is, for t > 0

b lasso = argPmin (Y X ) (Y
0
X ) (60)
t p
with j=1 j j j t

Because of the shape of the constraint region


8 9
< Xp =
2 <p j j jj t
: ;
j=1

71
Figure 16: Cartoon Representing the Constrained Version of Lasso Optimization
for p = 2

lasso
(in particular its sharp corners at coordinate axes) some coordinates of b t
ols
are often 0, and the lasso automatically provides simultaneous shrinking of b
toward 0 and rational subset selection. (The same is true of cases of form (59)
with q < 1.)
Figure 16 is a representation of the constrained version of the lasso opti-
mization problem for p = 2. Pictured are a contour plot for the quadratic error
0
sum of squares (Y X ) (Y X ) function of , the constraint region for
ols lasso
, b and b t .
For comparison purposes, Figure 17 provides representations of p = 2 bridge
regression constraint regions for t = 1. For q < 1 the regions not only have
"corners," but are not convex.

Figure 17: p = 2 "bridge" constraint regions for t = 1.

It is not obvious how to produce a useful version of formula (57), i.e.


N
1 X
e¤ective degrees of freedom = 2
Cov (^
yi ; yi )
i=1

for the lasso. But Zhou, Hastie, and Tibshirani in 2007 (AOS ) argued that

72
lasso
this is the mean number of non-zero components of b : Obviously then, the
random variable
lasso
\
df ( ) = the number of non-zero components of b
is an unbiased estimator of the e¤ective degrees of freedom.
There are a number of modi…cations of the ridge/lasso idea. One is the
"elastic net" idea. This is a compromise between the ridge and lasso methods.
For an 2 (0; 1) and some t > 0, this is de…ned by

b enet = arg min (Y X ) (Y


0
X )
;t Pp
j=1 ((1 )
with )j 2
j j+ j t

(The constraint is a compromise between the ridge and lasso constraints.) For
comparison purposes, Figure 18 provides some representations of p = 2 elastic
net constraint regions for t = 3 (made using some code of Prof. Huaiqing Wu)
that clearly show the compromise nature of the elastic net. The constraint
regions have "corners" like the lasso regions but are otherwise more rounded
than the lasso regions.

Figure 18: Some p = 2 elastic net constraint regions for t = 3.

The equivalent unconstrained optimization speci…cation of elastic net …tted


coe¢ cient vectors is for 1 > 0 and 2 > 0
8 9
enet
< Xp Xp =
b = arg min (Y X )
0
(Y X ) + 1 j j j + 2
2
1; 2 j
2<p : ;
j=1 j=1

73
Several sources (including a 2005 JRSSB paper of Zhou and Hastie) suggest
that a modi…cation of the elastic net idea, namely

(1 + 2)
b enet (61)
1; 2

performs better than the original version.


enet
For b 1 ; 2 with r non-zero components and X made up of the corresponding
columns of X, estimated e¤ective degrees of freedom for the unmodi…ed form
of the elastic net are
r
!
1 X d2
j
df \ 0
( 1 ; 2 ) = tr X X X + 2 I 0
X = (62)
j=1
d2j + 2

(for dj s the singular values of X ). The modi…ed form (61) has estimated
e¤ective degrees of freedom (1 + 2 ) times this value (62).
Breiman proposed a di¤erent shrinkage methodology he called the nonneg-
ative garotte that attempts to …nd "optimal" reweightings of the elements of
b ols . That is, for > 0 Breiman considered the vector optimization problem
de…ned by
8 9
< ols 0 ols Xp =
c = arg min Y Xdiag (c) b Y Xdiag (c) b + cj
c2<p with cj 0; j=1;:::;p : ;
j=1

and the corresponding …tted coe¢ cient vector


0 bols 1
c 1 1
b nng ols B .. C
= diag (c ) b =@ . A
c p bpols
HTF provide explicit formulas for …tted coe¢ cients for the special case of
X with orthonormal columns. (The table below is mostly their Table 4.3.)

Method of Fitting Fitted Coe¢ cient for xj


OLS bols
j h i
Best Subset (of Size M ) bols I rank bols M
j j

bols 1
Ridge Regression j
1+
Lasso and (1 + 2)
b enet sign bjols bols
1; 2 j
2 +
sign bjols bols
1 1
Elastic Net 1+ j
2
0 1 2 +

Nonnegative Garotte bols B


@1
C
j 2A
2 bjols
+

74
These formulas show that best subset regression provides a kind of "hard thresh-
olding" of the least squares coe¢ cients (setting all but the M largest to 0) and
ridge regression provides (the same) shrinking of all coe¢ cients toward 0. Both
the lasso and the nonnegative garotte provide a kind of "soft thresholding"
of the coe¢ cients (typically "zeroing out" some small ones). The elastic net
provides both the ridge type shrinkage of all the coe¢ cients and the lasso soft
thresholding. Note that in this "orthonormal columns" case, modi…cation of
the elastic net coe¢ cient vector as in formula (61) simply reduces it to a cor-
responding lasso coe¢ cient vector. (When the predictors are not orthogonal,
i.e. uncorrelated, one can expect the modi…ed elastic net to be something other
than a lasso.)
For comparison purposes, Figure 19 provides plots of the functions (in the
previous table) of OLS coe¢ cients giving ridge (blue), lasso (red), and nonneg-
ative garotte (green) coe¢ cients for the "orthonormal predictors" case. (Solid
lines are = 1 plots and dotted ones are for = 3.)

Figure 19: Plots of shrunken coe¢ cients for the "orthonomal inputs xj " case.
Ridge is (blue), lasso is (red), and nonnegative garotte is (green).

By now, a wide variety of lasso-like penalized least squares methods have


been suggested, tailored to a variety of special circumstances (and are discussed,
for example, by Hastie, Tibshirani and Wainwright). Notable are so-called
"group lasso," "sparse group lasso," and "fused lasso" methods. To give the
‡avor of what has been proposed, we’ll illustrate the (2-) group lasso. If for
some reason the coordinates of x 2 <p break naturally into 2 groups (say the
…rst l and last p l coordinates of x). For a > 0, a "group lasso" coe¢ cient
vector is of the form
8 0v v 19
< u l u X p =
group lasso uX u
b = arg min (Y X ) (Y X ) + @t
0 2 t 2A
j + j
2<p : ;
j=1 j=l+1

Of course, there can be more than 2 groups, and in the event that each group

75
is of size 1 this reduces to the simple lasso.
Looking at the geometry of the kind of constraint regions that are associated
with this methodology, it’s plausible (and correct) that it tends to "zero-out"
coe¢ cients in groups associated with the penalty. Figure 20 provides a rep-
resentation of a p = 3 constraint region associated with a grouped lasso where
coordinates 1 and 2 of x are grouped separate from coordinate 3. The corre-
sponding lasso region is shown for comparison purposes.

Figure 20: A p = 3 constraint region associated with a grouped lasso where


coordinates 1 and 2 of x are grouped separate from coordinate 3.

The development of the lasso and related predictors has been built on mini-
mization of a penalized version of the error sum of squares, N err for SEL. All of
the theory and representations here are special to this case. But as a practical
matter, as long as one has an e¤ective/appropriate optimization algorithm there
is nothing to prevent consideration of other losses. Possibilities include at least

1. using a negative Bernoulli loglikelihood as a loss and considering penalized


logistic regression (either as simply a means of …tting P [y = 1jx], or for
purposes of producing a good voting function for classi…cation), or
2. using a penalized exponential or hinge loss as in Section 1.5.3 for purposes
of producing a good voting function for classi…cation, or
3. using a penalized negative AUC loss for producing a good ordering func-
tion O.

The …rst of these is an option in the famous glmnet package in R.

3.1.3 Least Angle Regression (LAR)


Another class of shrinkage methods is de…ned algorithmically, rather than di-
rectly algebraically, or in terms of solutions to optimization problems. This
includes the LAR (least angle regression). A description of the whole set of

76
LAR
LAR regression parameters b (for the case of X with each column cen-
tered and with norm 1 and centered Y ) follows. (This is some kind of
amalgam of the descriptions of Izenman, CFZ, and the presentation in the 2003
paper Least Angle Regression by Efron, Hastie, Johnstone, and Tibshirani.)
Note that for Y^ a vector of predictions, the vector

c^ X0 Y Y^

has elements that are proportional to the correlations between the columns of
X (the xj ) and the residual vector R = Y Y^ . We’ll let

C^ = max j^
cj j and sj = sign (^
cj )
j

Notice also that if X l is some matrix made up of l linearly independent columns


of X, then if b is an l-vector and W = X l b , b can be recovered from W as
1
X 0l X l X 0l W . (This latter means that if we de…ne a path for Y^ vectors in
C (X) and know for each Y^ which linearly independent set of columns of X
is used in its creation, we can recover the corresponding path b takes through
<p .)

1. Begin with Y^ 0 = 0; b 0 = 0;and R0 = Y Y^ = Y and …nd

j1 = arg max jhxj ; Y ij


j

(the index of the predictor xj most strongly correlated with y) and add
j1 to an (initially empty) "active set" of indices, A.
2. Move Y^ from Y^ 0 in the direction of the projection of Y onto the space
spanned by xj1 (namely hxj1 ; Y i xj1 ) until there is another index j2 6= j1
with D E D E
cj2 j = xj2 ; Y Y^ = xj1 ; Y Y^ = j^
j^ cj1 j

At that point, call the current vector of predictions Y^ 1 and the corre-
sponding current parameter vector b 1 and add index j2 to the active set
A. As it turns out, for
( ! !)+
C^0 c^0j C^0 + c^0j
1 = min ;
j6=j1 1 hxj ; xj1 i 1 + hxj ; xj1 i

(where the "+" indicates that only positive values are included in the
minimization) Y^ 1 = Y^ 0 + sj1 1 xj1 = sj1 1 xj1 and b 1 is a vector of all 0s
except for sj1 1 in the j1 position. Let R1 = Y Y^ 1 .

3. At stage l with A of size l; b l 1 (with only l 1 non-zero entries),


Y l 1 ; Rl 1 = Y Y l 1 ; and c^l 1 = X Rl 1 in hand, move from Y^ l 1 to-
^ ^ 0

ward the projection of Y onto the sub-space of <N spanned by fxj1 ; : : : ; xjl g.

77
This is (as it turns out) in the direction of a unit ul vector "making equal
angles less than 90 degrees with all xj with j 2 A" until there is an index
jl+1 2
= A with

j^
cl 1;j+1 j = j^
cl 1;j1 j ( = j^
cl 1;j2 j = = j^
cl 1;jl j )

At that point, with Y^ l the current vector of predictions, let b l (with only l
non-zero entries) be the corresponding coe¢ cient vector, take Rl = Y Y^ l
and c^l = X 0 Rl . It can be argued that with
( ! !)+
C^l 1 c^l 1;j C^l 1 + c^l 1;j
l = min ;
j 2A
= 1 hxj ; ul i 1 + hxj ; ul i

Y^ l = Y^ l 1 + sjl+1 l ul . Add the index jl+1 to the set of active indices,


A, and repeat.

This continues until there are r = rank (X) indices in A, and at that point Y^
ols ols
moves from Y^ r 1 to Y^ and b moves from b r 1 to b (the version of an
OLS coe¢ cient vector with non-zero elements only in positions with indices in
A). This de…nes a piecewise linear path for Y^ (and therefore b ) that could,
for example, be parameterized by Y^ or Y Y^ .
There are several issues raised by the description above. For one, the stan-
dard exposition of this method seems to be that the direction vector ul is pre-
scribed by letting W l = (sj1 xj1 ; : : : ; sjl xjl ) and taking

1 1
ul = 1
W l W 0l W l 1
W l W 0l W l 1

1 1
It’s clear that W 0l ul = W l W 0l W l 1 1, so that each of sj1 xj1 ; : : : ; sjl xj1
has the same inner product with ul . What is not immediately clear (but is
argued in Efron, Hastie, Johnstone, and Tibshirani) is why one knows that this
prescription agrees with a prescription of a unit vector giving the direction from
Y^ l 1 to the projection of Y onto the sub-space of <N spanned by fxj1 ; : : : ; xjl g,
namely (for P l the projection matrix onto that subspace)
1
P lY Y^ l 1
P lY Y^ l 1

Further, the arguments that establish that j^ cl 1;j1 j = j^


cl 1;j2 j = =
j^
cl 1;jl j are not so obvious, nor are those that show that l has the form in
3. above. And …nally, HTF actually state their LAR algorithm directly in
terms of a path for b , saying at stage l one moves from b l 1 in the direction
of a vector with all 0s except at those indices in A where there are the joint
least squares coe¢ cients based on the predictor columns fxj1 ; : : : ; xjl g. The

78
correspondence between the two points of view is probably correct, but is again
not absolutely obvious.
ols
At any rate, the LAR algorithm traces out a path in <p from 0 to b . One
might think of the point one has reached along that path (perhaps parameterized
by Y^ ) as being a complexity parameter governing how ‡exible a …t this
algorithm has allowed, and be in the business of choosing it (by cross-validation
or some other method) in exactly the same way one might, for example, choose
a ridge parameter.
What is not at all obvious but true, is that a very slight modi…cation of
this LAR algorithm produces the whole set of lasso coe¢ cients (60) as its path.
One simply needs to enforce the requirement that if a non-zero coe¢ cient hits
0, its index is removed from the active set and a new direction of movement
is set based on one less input
Pvariable. At any point along the modi…ed LAR
p
path, one can compute t = j=1 j j j, and think of the modi…ed-LAR path as
parameterized by t. (While it’s not completely obvious, this turns out to be
monotone non-decreasing in "progress along the path," or Y^ ).
A useful graphical representation of the lasso path is one in which all coe¢ -
cients ^tj
lasso
are plotted against t on the same set of axes. Something similar is
often done for the LAR coe¢ cients (where the plotting is against some measure
of progress along the path de…ned by the algorithm).

3.2 Two Methods With Derived Input Variables


Another possible approach to …nding an appropriate level of complexity in a
…tted linear prediction rule is to consider regression on some number M < p of
predictors derived from the original inputs xj . Two such methods are those of
Principal Components Regression and Partial Least Squares. Here we continue
to assume that the columns of X have been standardized and Y has
been centered.

3.2.1 Principal Components Regression


The idea here is to replace the p columns of predictors in X with the …rst few
(M ) principal components of X (from the singular value decomposition of X)

z j = Xv j = dj uj

Correspondingly, the vector of …tted predictions for the training data is

XM
p cr hY ; z j i
Yb = zj
j=1
hz j ; z j i
M
X
= hY ; uj i uj (63)
j=1

79
Comparing this to displays (42) and (56) we see that ridge regression shrinks
the coe¢ cients of the principal components uj according to their importance
in making up X, while principal components regression "zeros out" those least
important in making up X. Further, since the uj constitute an orthonormal
basis for C (X), for rank (X) = r,

2 M
X r
X 2
p cr ols
Yb hY ; uj i = Yb
2 2
= hY ; uj i (64)
j=1 j=1

p cr
Notice too, that Yb can be written in terms of the original inputs as
M
X
p cr 1
Yb = hY ; uj i Xv j
j=1
dj
0 1
X M
1
=X@ hY ; uj i v j A
j=1
dj
0 1
X M
1
=X@ 2 hY ; Xv j i v j
A
j=1
d j

so that
XM
b p cr = 1
hY ; Xv j i v j (65)
d2
j=1 j
ols p cr
and b is the M = r = rank (X) version of b . As the v j are orthonormal,
as in relationship (64) above

b p cr b ols

ols ols
and principal components regression shrinks both Yb toward 0 in <N and b
toward 0 in <p .

3.2.2 Partial Least Squares Regression


The shrinking methods mentioned thus far have taken no account of Y in de-
termining directions or amounts of shrinkage. Partial least squares speci…cally
employs Y . In what follows, we continue to suppose that the columns of X
have been standardized and that Y has been centered.
The logic of partial least squares is this. Suppose that
p
X
z1 = hY ; xj i xj
j=1

= XX 0 Y

80
It is possible to argue that for w1 = X 0 Y = X 0 Y ; Xw1 = z 1 = X 0 Y is a
linear combination of the columns of X maximizing

jhY ; Xwij

(which is essentially the absolute sample covariance between the variables y and
x0 w) subject to the constraint that kwk = 1.17 This follows because
2
hY ; Xwi = w0 X 0 Y Y 0 Xw

and a maximizer of this quadratic form subject to the constraint is the eigen-
vector of X 0 Y Y 0 X corresponding to its single non-zero eigenvalue. It’s then
easy to verify that w1 is such an eigenvector corresponding to the non-zero
eigenvalue Y 0 XX 0 Y .
Then de…ne X 1 by orthogonalizing the columns of X with respect to z 1 .
That is, de…ne the jth column of X 1 by

hxj ; z 1 i
x1j = xj z1
hz 1 ; z 1 i

and take
p
X
z2 = Y ; x1j x1j
j=1

= X 1 X 10 Y

For w2 = X 10 Y = X 10 Y ; X 1 w2 = z 2 = X 10 Y is the linear combination of


the columns of X 1 maximizing

Y ; X 1w

subject to the constraint that kwk = 1.


Then for l > 1, de…ne X l by orthogonalizing the columns of X l 1
with
respect to z l . That is, de…ne the jth column of X l by

xlj 1 ; z l
xlj = xlj 1
zl
hz l ; z l i

and let
p
X
z l+1 = Y ; xlj xlj
j=1

= X l X l0 Y
1 7 Note that upon replacing jhY ; Xwij with jhXw; Xwij one has the kind of optimization

problem solved by the …rst principal component of X.

81
Partial least squares regression uses the …rst M of these variables z j as input
variables.
The PLS predictors z j are orthogonal by construction. Using the …rst M
of these as regressors, one has the vector of …tted output values

XM
pls hY ; z j i
Yb = zj
j=1
hz j ; z j i

Since the PLS predictors are (albeit recursively-computed data-dependent) lin-


pls
ear combinations of columns of X, it is possible to …nd a p-vector b (namely M
1 pls
X 0X X 0 Yb ) such that
pls pls
Yb = X b M

and thus produce the corresponding linear prediction rule


pls
fb(x) = x0 b M (66)

It is tempting to think that in form (66), the number of components, M ,


should function as a complexity parameter. But then again there is the follow-
ing. When the xj are orthogonal, it’s fairly easy to see that z 1 is a multiple of
ols
Yb . That is, in this circumstance,

X0 X = N I

so that
ols 1 1 1
Yb = X X 0 X X0 Y = XX 0 Y = z 1
N N
i.e.
ols
z 1 = N Yb
so that
pls ols
Yb 1 = Yb
pls pls pls ols
and thus b 1 = b 2 = = b p = b . All steps of partial least squares
after the …rst are simply providing a basis for the orthogonal complement of the
ols
1-dimensional subspace of C (X) generated by Yb (without improving …tting
at all). That is, here changing M doesn’t change ‡exibility of the …t at all.
(Presumably, when the xj are nearly orthogonal, something similar happens.)
This observation about PLS in cases where predictors are orthogonal has
another related implication. That is that there will be no naive form for e¤ective
degrees of freedom for PLS. Since with z j the jth principal component of X
and, say,
Z M = (z 1 ; z 2 ; : : : ; z M )

82
we have
0 1 0
p cr
Yb = ZM ZM ZM ZM Y

principal components regression on M components has e¤ective degrees of free-


dom M . But the fact that the "Z M " matrix corresponding to PLS depends
upon Y makes PLS nonlinear in Y . And the "orthogonal X" argument shows
that a PLS predictor with M = 1 can have e¤ective degrees of freedom as large
as rank (X).

PLS, PCR, and OLS Partial least squares is a kind of compromise between
principal components regression and ordinary least squares. To see this, note
that maximizing
jhY ; Xwij
subject to the constraint that kwk = 1 is equivalent to maximizing the absolute
sample covariance between Y and Xw i.e.
sample standard sample standard sample correlation
deviation of y deviation of x0 w between y and x0 w
or equivalently
2
sample variance sample correlation
(67)
of x0 w between y and x0 w
subject to the constraint. Now if only the …rst term (the sample variance of
x0 w) were involved in product (67), a …rst principal component direction would
be an optimizing w1 , and z 1 = X 0 Y Xw1 a multiple of the …rst principal
component of X. On the other hand, if only the second term were involved,
^ols = ^ols would be an optimizing w1 , and z 1 = Y^ ols X 0 Y = ^ols a multi-
ple of the vector of ordinary least squares …tted values. The use of the product
of two terms can be expected to produce a compromise between these two.
Note further that this logic applied at later steps in the PLS algorithm then
produces for z l a compromise between a …rst principal component of X l 1
and a suitably constrained multiple of the vector of least squares …tted values
based on the matrix of inputs X l 1 . The matrices X l have columns that
are the projections of the corresponding columns of X onto the orthogonal
complement in C (X) of the span of fz 1 ; z 2 ; : : : ; z l g (i.e. are corresponding
columns of X minus their projections onto the span of fz 1 ; z 2 ; : : : ; z l g) and
C (X) C X 1 C X2 .

4 Linear SEL Prediction Using Basis Functions


A way of moving beyond prediction rules that are functions of a linear form in
x, i.e. depend upon x through x0 ^, is to consider some set of (basis)18 functions
1 8 The word "basis" is employed here to point to the notion of a "basis" in a linear space

of functions, whereby any function of interest can be represented (or practically speaking, at

83
fhm g and predictors of the form or depending upon the form
p
X
f^ (x) = ^m hm (x) = h (x)0 ^ (68)
m=1

0
for h (x) = (h1 (x) ; : : : ; hp (x)). (The general notation used in Section 1.4.5
was T (x) rather than h (x) being used here. The slight specialization here is to
the case where the components of the vector-valued h (x) are "basis" functions.)
We next consider some ‡exible methods employing this idea. Notice that
…tting of form (68) can be done using any of the methods just discussed based
on the N p matrix of inputs
0 0 1
h (x1 )
B h (x2 )0 C
B C
X = (hj (xi )) = B .. C
@ . A
0
h (xN )

(i indexing rows and j indexing columns).

4.1 p = 1 Wavelet Bases


Consider …rst the case of a one-dimensional input variable x, and in fact here
suppose that x takes values in [0; 1]. One might consider a set of basis function
for use in the form (68) that is big enough and rich enough to approximate
essentially any function on [0; 1]. In particular, various orthonormal bases for
the square integrable functions on this interval (the space of functions L2 [0; 1])
come to mind. One might, for example, consider using some number of functions
from the Fourier basis for L2 [0; 1]
np o1 np o1
2 sin (j2 x) [ 2 cos (j2 x) [ f1g
j=1 j=1

For example, using M N=2 sin-cos pairs and the constant, one could consider
the …tting the forms
M
X M
X
f (x) = 0 + 1m sin (m2 x) + 2m cos (m2 x) (69)
m=1 m=1

If one has training xi on an appropriate regular grid, the use of form (69) leads
to orthogonality in the N (2M + 1) matrix of values of the basis functions X
and simple/fast calculations.
least approximated) as a linear combination of the the "basis" elements. Periodic functions
of a single variable can be approximated by linear combinations of sine (basis) functions of
various frequencies. General di¤erentiable functions can be approximated by polynomials
(linear combinations of monomial basis functions). Etc.

84
Unless, however, one believes that E[yjx = u] is periodic in u, form (69) has
its serious limitations. In particular, unless M is very very large, a trigono-
metric series like (69) will typically provide a poor approximation for a function
that varies at di¤erent scales on di¤erent parts of [0; 1], and in any case, the co-
e¢ cients necessary to provide such localized variation at di¤erent scales have no
obvious simple interpretations/connections to the irregular pattern of variation
being described. So-called "wavelet bases" are much more useful in providing
parsimonious and interpretable approximations to such functions. The simplest
wavelet basis for L2 [0; 1] is the Haar basis that we proceed to describe.
De…ne the so-called Haar "father" wavelet

' (x) = I [0 < x 1]

and the so-called Haar "mother" wavelet

(x) = ' (2x) ' (2x 1)


1 1
=I 0<x I <x 1
2 2

Linear combinations of these functions provide all elements of L2 [0; 1] that are
constant on 0; 21 and on 12 ; 1 . Write

0 = f'; g

Next, de…ne
p 1 1 1
1;0 (x) = 2 I 0<x I <x and
4 4 2
p 1 3 3
1;1 (x) = 2 I <x I <x 1
2 4 4
and let
1 =f 1;0 ; 1;1 g

Using the set of functions 0 [ 1 one can build (as linear combinations) all
elements of L2 [0; 1] that are constant on 0; 41 and on 14 ; 12 and on 21 ; 34 and
on 34 ; 1 .
The story then goes on as one should expect. One de…nes
1 1 1
2;0 (x) = 2 I 0 < x I <x and
8 8 4
1 3 3 1
2;1 (x) = 2 I <x I <x and
4 8 8 2
1 5 5 3
2;2 (x) = 2 I <x I <x and
2 8 8 4
3 7 7
2;3 (x) = 2 I <x I <x 1
4 8 8

85
and lets
2 =f 2;0 ; 2;1 ; 2;2 ; 2;3 g
Figure 21 shows the sets of basis functions 0; 1; and 2.

Figure 21: Sets of Haar basis functions 0 (blue), 1 (red), and 2 (green).

In general,
p j
m;j (x) = 2m 2m x for j = 0; 1; 2; : : : ; 2m 1
2m
and
m =f m;0 ; m;1 ; : : : ; m;2m 1 g
The Haar basis of L2 [0; 1] is then
[1
m=0 m

86
Then, one might entertain use of the Haar basis functions through order M
in constructing a form
m
M 2X1
X
f (x) = 0 + mj m;j (x) (70)
m=0 j=0

(with the understanding that 0;0 = ), a form that in general allows building
of functions that are constant on consecutive intervals of length 1=2M +1 . This
form can be …t by any of the various regression methods (especially involving
thresholding/selection, as a typically very large number, 2M +1 , of basis func-
tions is employed in form (70)). (See HTF Section 5.9.2 for some discussion of
using the lasso with wavelets.) Large absolute values of coe¢ cients mj encode
scales at which important variation in the value of the index m, and location
in [0; 1] where that variation occurs in the value j=2m . Where (perhaps af-
ter model selection/ thresholding) only a relatively few …tted coe¢ cients are
important, the corresponding scales and locations provide an informative and
compact summary of the …t. A nice visual summary of the results of the …t
can be made by plotting for each m (plots arranged vertically, from M through
0, aligned and to the same scale) spikes of length j mj j pointed in the direction
of sign( mj ) along an "x" axis at positions (say) (j=2m ) + 1=2m+1 .
In special situations where N = 2K and
1
xi = i for i = 1; 2; : : : ; 2K
2K
and one uses the Haar basis functions through order K 1, the …tting of form
(70) is computationally clean, since the vectors
0 1
m;j (x1 )
B .. C
@ . A
m;j (xN )

(together with the column vector


p of 1s) are orthogonal. (So, upon proper
normalization, i.e. division by N = 2K=2 , they form an orthonormal basis for
<N .)
The Haar wavelet basis functions are easy to describe and understand. But
they are discontinuous, and from some points of view that is unappealing. Other
sets of wavelet basis functions have been developed that are smooth. The
construction begins with a smooth "mother wavelet" in place of the step function
used above. HTF make some discussion of the smooth "symmlet" wavelet basis
at the end of their Chapter 5.

4.2 p = 1 Piecewise Polynomials and Regression Splines


Continue consideration of the case of a one-dimensional input variable x, and
now K "knots"
1 < 2 < < K

87
and forms for f (x) that are

1. polynomials of order M (or less) on all intervals ( j 1 ; j ), and (potentially,


at least)
2. have derivatives of some speci…ed order at the knots, and (potentially, at
least)
3. are linear outside ( 1 ; K ).

If we let I1 (x) = I [x < 1 ],for j = 2; : : : ; K let Ij (x) = I [ j 1 x < j ],


and de…ne IK+1 (x) = I [ K x], one can have 1. in the list above using basis
functions

I1 (x) ; I2 (x) ; : : : ; IK+1 (x)


xI1 (x) ; xI2 (x) ; : : : ; xIK+1 (x)
x2 I1 (x) ; x2 I2 (x) ; : : : ; x2 IK+1 (x)
..
.
xM I1 (x) ; xM I2 (x) ; : : : ; xM IK+1 (x)

Further, one can enforce continuity and di¤erentiability (at the knots) conditions
P(M +1)(K+1)
on a form f (x) = m=1 m hm (x) by enforcing some linear relations
between appropriate ones of the m . While this is conceptually simple, it is
messy. It is much cleaner to simply begin with a set of basis functions that are
tailored to have the desired continuity/di¤erentiability properties.
A set of M + 1 + K basis functions for piecewise polynomials of degree M
with derivatives of order M 1 at all knots is easily seen to be
M M M
1; x; x2 ; : : : ; xM ; (x 1 )+ ; (x 2 )+ ; : : : ; (x K )+

M
(since the value and …rst M 1 derivatives of (x j )+ at j are all 0). The
choice of M = 3 is fairly standard.
Since extrapolation with polynomials typically gets worse with order, it is
common to impose a restriction that outside ( 1 ; K ) a form f (x) be linear. For
the case of M = 3 this can be accomplished by beginning with basis functions
3 3 3
1; x; (x 1 )+ ; (x 2 )+ ; : : : ; (x K )+ and imposing restrictions necessary to
force 2nd and 3rd derivatives to the right of K to be 0: Notice that (considering
x > K)
0 1
XK XK
d2 @ 3A
0 + 1 x + j (x j )+ = 6 j (x j) (71)
dx2 j=1 j=1

and 0 1
K
X K
X
d3 @ 3A
0 + 1x + j (x j )+ =6 j (72)
dx3 j=1 j=1

88
PK
So, linearity for large x requires (from equation (72)) that j=1 j = 0. Fur-
ther, substituting this into relationship (71) means that linearity also requires
PK PK 1
that j=1 j j = 0. Using the …rst of these to conclude that K = j=1 j
and substituting into the second yields
K
X2 K j
K 1 = j
j=1 K K 1

and then
K
X2 K
X2
K j
K = j j
j=1 K K 1 j=1

These then suggest the set of basis functions consisting of 1; x and for j =
1; 2; : : : ; K 2

3 K j 3 K j 3 3
(x j )+ (x K 1 )+ + (x K )+ (x K )+
K K 1 K K 1
(73)
3 K j 3 K 1 j 3
= (x j )+ (x K 1 )+ + (x K )+
K K 1 K K 1

(These are essentially the basis functions that HTF call their Nj .) Their use
produces so-called "natural" (linear outside ( 1 ; K )) cubic regression splines.
There are other (harder to motivate, but in the end more pleasing and
computationally more attractive) sets of basis functions for natural polynomial
splines. See the B-spline material at the end of HTF Chapter 5.

4.3 Basis Functions and p-Dimensional Inputs


4.3.1 Multi-Dimensional Regression Splines (Tensor Product Bases)
If p = 2 and the vector of inputs, x, takes values in <2 , one might proceed as
follows. If fh11 ; h12 ; : : : ; h1M1 g is a set of spline basis functions based on x1
and fh21 ; h22 ; : : : ; h2M2 g is a set of spline basis functions based on x2 one might
consider the set of M1 M2 basis functions based on x de…ned by

gjk (x) = h1j (x1 ) h2k (x2 )

and corresponding forms for regression splines


X
f (x) = jk gjk (x) (74)
j;k

The biggest problem with this potential method is the explosion in the size
of a tensor product basis as p increases. For example, using K knots for cubic
p
regression splines in each of p dimensions produces (4 + K) basis functions
for the p-dimensional problem. Some kind of forward selection algorithm or

89
shrinking of coe¢ cients will be needed to produce any kind of workable …ts
with such large numbers of basis functions. For example, the multivariate
smoothing routines provided in the mgcv R package of Wood allow for quadratic
penalized (ridge regression type) …tting of forms like (74). The following discus-
sion of "MARS" concerns one kind of forward selection algorithm using (data-
dependent) linear regression spline basis functions and products of them for
building predictors

4.3.2 MARS (Multivariate Adaptive Regression Splines)


This is a high-dimensional regression methodology based on use of data-dependent
"hockey-stick" or "hinge" functions (the kind of functions leading to piece-wise
linear regression splines when p = 1) and their products as (data-dependent)
"basis functions." That is, with input space <p consider de…ning data-dependent
features19 built on the N p pairs of functions

hij1 (x) = (xj xij )+ and hij2 (x) = (xij xj )+ (75)

(xij is the jth coordinate of the ith input training vector and both hij1 (x) and
hij2 (x) depend on x only through the jth coordinate of x) portrayed in Figure
22.

Figure 22: Pair of hinge functions.

MARS builds predictors sequentially, making use of these "re‡ected pairs"


of hinge functions and their products. One version (described in HTF) proceeds
roughly as follows.

1. Identify a pair (75) so that

0 + 11 hij1 (x) + 12 hij2 (x)

has the best SSE possible. Call the selected functions

g11 = hij1 and g12 = hij2

and set
f^1 (x) = ^0 + ^11 g11 (x) + ^12 g12 (x)
1 9 Notice that in the framework of Section 1.4.5 these functions of the input x are of the

form T (T ; x), NOT simply of the form T (x).

90
2. At stage l of the predictor-building process, with predictor
l 1
X
f^l 1 (x) = ^0 + ^m1 gm1 (x) + ^m2 gm2 (x)
m=1

in hand, consider for addition to the model pairs of functions that are
either of the basic form (75) or of the form

hij1 (x) gm1 (x) and hij2 (x) gm1 (x)

or of the form

hij1 (x) gm2 (x) and hij2 (x) gm2 (x)

for some m < l, subject to the constraint that no xj appears in any


candidate product more than once (maintaining the piece-wise linearity of
sections of the predictor). Additionally, one may decide to put an upper
limit on the order of the products considered for inclusion in the predictor.
The best candidate pair in terms of reducing SSE gets called, say, gl1 and
gl2 and one sets
l
X
f^l (x) = ^0 + ^m1 gm1 (x) + ^m2 gm2 (x)
m=1

One might pick the complexity parameter l by cross-validation, but the


standard implementation of MARS apparently uses instead a kind of generalized
cross validation error
PN 2
i=1 yi f^l (xi )
GCV (l) = 2
M (l)
1 N

where M (l) is some kind of degrees of freedom …gure. One must take account
of both the …tting of the coe¢ cients in this and the fact that knots (values
xij ) have been chosen. The HTF recommendation is to use

M (l) = 2l + (2 or 3) (the number of di¤erent knots chosen)

(where presumably the knot count refers to di¤erent xij appearing in at least
one gm1 (x) or gm2 (x)).
Other versions of "MARS" algorithms potentially remove the constraint that
no xj appear in any candidate product more than once (eliminating the piece-
wise linearity of sections of the predictor), consider not pairs but single hinge
functions at each stage of feature addition, and/or follow a forward-selection
search for features with a backwards-elimination phase (these guided by signif-
icant "change in SSE" or "F/t test" criteria). All of these variants amount

91
to the "special sauce" of a particular MARS implementation set by its de-
signer/programmer. Particular implementations have user-selectable parame-
ters like the maximum number of terms in a forward selection phase, the maxi-
mum order of (pure and mixed) terms considered, the "signi…cance level" used
for guiding forward and backward phases of selection of "features," etc. In
practical application, one should select these parameters via cross-validation,
more or less thinking of whatever choices the developer has made in his or her
implementation as simply de…ning some …tting/predictor-building "black box."
A routine like the train() function in caret is invaluable in making these
choices.
Figure 23 portrays a simple predictor (of home sales price) of the kind that
a MARS algorithm can produce.

Figure 23: An example of the kind of prediction surface that can be generated
by a MARS algorithm. "Price" varies with two predictors.

5 Smoothing Splines and SEL Prediction


5.1 p = 1 Smoothing Splines
A way of avoiding the direct selection of knots for a regression spline is to
instead, for a smoothing parameter > 0, consider the problem of …nding (for
a min fxi g and max fxi g b)
N Z b !
X 2 2
f^ = arg min (yi h (xi )) + (h00 (x)) dx
functions h with 2 derivatives i=1 a

Amazingly enough, this optimization problem has a solution that can be fairly
simply described. f^ is a natural cubic spline with knots at the distinct values
xi in the training set. That is, for a set of (now data-dependent, as the knots
come from the training data) basis functions for such splines
h1 ; h2 ; : : : ; hN

92
(here we’re tacitly assuming that the N values of the input variable in the
training set are all di¤erent)
N
X
f^ (x) = b j hj (x) (76)
j=1

where the b j are yet to be identi…ed.


So consider the function
N
X
g (x) = j hj (x) (77)
j=1

This has second derivative


N
X
g 00 (x) = 00
j hj (x)
j=1

and so
N X
X N
2
(g 00 (x)) = 00
j l hj (x) h00l (x)
j=1 l=1
0
Then, for = ( 1; 2; : : : ; N ) and20
Z !
b
= h00j (t) h00l (t) dt
N N a

it is the case that Z b


2 0
(g 00 (x)) dx =
a
In fact, with the notation
H = (hj (xi ))
N N

(i indexing rows and j indexing columns) the criterion to be optimized in order


to …nd f^ can be written for functions of the form (77) as
0 0
(Y H ) (Y H )+

and some vector calculus shows that the optimizing is


b = H 0H + 1
H 0Y (78)

which can be thought of as some kind of vector of generalized ridge regression


coe¢ cients. This form (78) together with representation (76) of course provides
a smoothed prediction of y for any input x.
2 0 For the set of cubic spline basis functions (73) it it unpleasant but straightforward to …nd

relatively simple formulas for the entries of . See the exercises for this section for details.

93
Corresponding to coe¢ cient vector (78) is a vector of smoothed output values
1
Yb = H H 0 H + H 0Y

and the matrix


1
S H H 0H + H0
is called a smoother matrix.
Contrast this to a situation where some fairly small number, p, of …xed basis
functions are employed in a regression context. That is, for basis functions
b1 ; b2 ; : : : ; bp suppose
B = (bj (xi ))
N p

Then OLS produces the vector of …tted values


1
Yb = B B 0 B B0Y
1
and the projection matrix onto the column space of B, C (B), is P B = B B 0 B B0.
S and P B are both N N symmetric non-negative de…nite matrices. While

P BP B = P B

i.e. P B is idempotent,
S S S
meaning that S S S is non-negative de…nite. P B is of rank p = tr(P B ),
while S is of rank N .
In a manner similar to what is done in ridge regression we might de…ne an
"e¤ective degrees of freedom" for S (or for smoothing) as

df ( ) = tr (S ) (79)

We proceed to develop motivation and a formula for this quantity and for Y^ .
Notice that for
1
K = H0 H 1
one has
1
S = H H 0H + H0
1
= H H0 I + H0 1
H H H0
1 1
= HH I + H0 1
H H0 1
H0
1
= (I + K) (80)

This is the so-called Reinsch form for S , from whence S 1 = I + K.


Some vector calculus shows that Yb = S Y is a solution to the minimization
problem
0
minimize (Y v) (Y v) + v 0 Kv (81)
v2<N

94
so that this matrix K can be thought of as de…ning a "penalty" in …tting a
smoothed version of Y :
Then, since S is symmetric non-negative de…nite, it has an eigen decom-
position as
XN
S = U DU 0 = dj uj u0j (82)
j=1

where columns of U (the eigenvectors uj ) comprise an orthonormal basis for


<N and
D = diag (d1 ; d2 ; : : : ; dN )
for eigenvalues of S
d1 d2 dN > 0
It turns out to be guaranteed that d1 = d2 = 1.
Consider how the eigenvalues and eigenvectors of S are related to those for
K. An eigenvalue for K, say , solves

det (K I) = 0

Now
1
det (K I) = det [(I + K) (1 + ) I]

So 1+ must be an eigenvalue of I + K and 1= (1 + ) must be an eigenvalue


1
of S = (I + K) . So for some j we must have
1
dj =
1+
and observing that 1= (1 + ) is decreasing in , we may conclude that
1
dj = (83)
1+ N j+1

for
1 2 N 2 N 1 = N =0
the eigenvalues of K (that themselves do not depend upon ). So, for example,
in light of facts (79), (82), and (83), the smoothing e¤ective degrees of freedom
are
XN N
X2 1
df ( ) = tr (S ) = dj = 2 +
j=1 j=1
1 + j

which is clearly decreasing in (with minimum value 2 in light of the fact that
S has two eigenvalues that are 1).
Further, consider uj , the eigenvector of S corresponding to eigenvalue dj .
S uj = dj uj so that
1
uj = S dj uj = (I + K) dj uj

95
so that
uj = dj uj + dj Kuj
and thus
1 dj
Kuj = uj = N j+1 uj
dj
That is, uj is an eigenvector of K corresponding to the (N j + 1)st largest
eigenvalue. That is, for all the eigenvectors of S are eigenvectors of K and
thus do not depend upon .
Then, for any
0 1
N
X
Yb = S Y = @ dj uj u0j A Y
j=1
N
X
= dj huj ; Y i uj
j=1
N
X huj ; Y i
= hu1 ; Y i u1 + hu2 ; Y i u2 + uj (84)
j=3
1 + N j+1

and we see that Yb is a shrunken version of Y (that progresses from Y to the


projection of Y onto the span of fu1 ; u2 g as runs from 0 to 1)21 . The larger
is , the more severe the shrinking overall. Further, the larger is j, the smaller
is dj and the more severe is the shrinking in the uj direction. (The unpenalized
directions u1 and u2 have no associated shrinking.) In the context of cubic
smoothing splines, large j correspond to "wiggly" (as a functions of coordinate
i or value of the input xi ) uj , and the prescription (84) calls for suppression of
"wiggly" components of Y .
Further, since Yb = H b and H is nonsingular, as runs from 0 to 1, b
runs from H 1 Y to H 1 (hu1 ; Y i u1 + hu2 ; Y i u2 ). And there is "shrinking"
0
enforced on b in the sense that the quadratic form b b must be non-
2
increasing in . (If not, the fact that Y Yb increases in would produce
a contradiction.)
Notice that large j correspond to early/large eigenvalues of the penalty ma-
trix K in (81). Letting uj = uN j+1 so that

U = (uN ; uN 1 ; : : : ; u1 )
= (u1 ; u2 ; : : : ; uN )

the eigen decomposition of K is


0
K = U diag ( 1 ; 2; : : : ; N ) U
2 1 It is possible to argue that the span of fu ; u g is the set of vectors of the form c1 + dx,
1 2
as is consistent with the integral penalty in original function optimization problem.

96
and criterion (81) can be written as
0 0
minimize (Y v) (Y v) + v 0 U diag ( 1 ; 2; : : : ; N ) U v
v2<N

or equivalently as
0 1
N
X2
0 2
minimize @(Y v) (Y v) + j uj ; v A (85)
v2<N
j=1

(since N 1 = N = 0) and we see that eigenvalues of K function as penalty


PN
coe¢ cients applied to the N orthogonal components of v = j=1 uj ; v uj in
the choice of optimizing v. From this point of view, the uj (or uj ) provide
the natural alternative (to the columns of H) basis (for <N ) for representing
or approximating Y , and the last equality in display (84) provides an explicit
form for the optimizing smoothed vector Yb .
In this development, K has had a speci…c meaning derived from the H and
matrices connected speci…cally with smoothing splines and the particular values
of x in the training dataset. But in the end, an interesting possibility brought
up by the whole development is that of forgetting the origins (from K) of the j
and uj and beginning with any interesting/intuitively appealing orthonormal
basis fuj g and set of non-negative penalties f j g for use in minimization (85).
Working backwards through relationships (84) and (83) one is then led to the
corresponding smoothed vector Yb and smoothing matrix S . (More detail on
this matter is in Section 5.3.)
It is also worth remarking that since Yb = S Y the rows of S pro-
vide weights to be applied to the elements of Y in order to produce predic-
tions/smoothed values corresponding to Y . These can for each i be thought
of as de…ning a corresponding "equivalent kernel" (for an appropriate "kernel-
weighted average" of the training output values as discussed in Section 6.1).
(See Figure 5.8 of HTF2 in this regard.)

5.2 Multi-Dimensional Smoothing Splines


If p = 2 and the vector of inputs, x, takes values in <2 , one might propose to
seek !
N
X
^ 2
f = arg min (yi h (xi )) + J [h]
functions h with 2 derivatives i=1

for ZZ 2 2 2
@2h @2h @2h
J [h] +2 + dx1 dx2
<2 @x21 @x1 @x2 @x22
An optimizing f^ : <2 ! < can be identi…ed and is called a "thin plate spline."
As ! 0, f^ becomes an interpolator, as ! 1 it de…nes the OLS plane

97
through the data in 3-space. In general, it can be shown to be of the form
N
X
0
f (x) = 0 + x+ i gi (x) (86)
i=1

where gi (x) = (kx xi k) for (z) = z 2 ln z 2 . The gi (x) are "radial basis
functions" (radially symmetric basis functions) and …tting is accomplished much
as for the p = 1 case. The form (86) is plugged into the optimization criterion
and a discrete penalized least squares problem emerges (after taking account of
some linear constraints that are required to keep J [f ] < 1). HTF seem to
indicate that in order to keep computations from exploding with N , it usually
su¢ ces to replace the N functions gi (x) in form (86) with K N functions
gi (x) = (kx xi k) for K potential input vectors xi placed on a rectangular
grid covering the convex hull of the N training data input vectors xi .
For large p, one might simply declare that attention is going to be limited
to predictors of some restricted form, and for h in that restricted class, seek to
optimize
XN
2
(yi h (xi )) + J [h]
i=1

for J [h] some appropriate penalty on h intended to regularize/restrict its wig-


gling. For example, one might assume that a form
p
X
g (x) = gj (xj )
j=1

will be used and set


p Z
X 2
J [g] = gj00 (x) dx
j=1

and be led to additive splines.


Or, one might assume that
p
X X
g (x) = gj (xj ) + gjk (xj ; xk ) (87)
j=1 j;k

and invent an appropriate penalty function. It seems like a sum of 1-d smooth-
ing spline penalties on the gj and 2-d thin plate spline penalties on the gjk is the
most obvious starting point. Details of …tting are a bit murky (though I am sure
that they can be found in book on generalized additive models). Presumably
one cycles through the summands in display (87) iteratively …tting functions to
sets of residuals de…ned by the original yi minus the sums of all other current
versions of the components until some convergence criterion is satis…ed. Func-
tion (87) has a kind of "main e¤ects plus 2-factor interactions" form, but it is
(at least in theory) possible to also consider higher order terms in this kind of
expansion.

98
5.3 An Abstraction of the Smoothing Spline Material and
Penalized Fitting in <N
In abstraction of the smoothing spline development, suppose that fuj g is a
set of M N orthonormal N -vectors, 0; j 0 for j = 1; 2; : : : ; M; and
consider the optimization problem
0 1
XM
0 2
minimize @(Y v) (Y v) + j huj ; vi
A
v2 spanfuj g
j=1

PM PM 2 PM 2
For v = j=1 cj uj 2 spanfuj g, the penalty is j=1 j huj ; vi = j=1 j cj
and in this penalty, j is a multiplier of the squared length of the component
of v in the direction of uj . The optimization criterion is then
M
X M
X M
X
0 2 2 2
(Y v) (Y v) + j huj ; vi = (huj ; Y i cj ) + j cj
j=1 j=1 j=1

and it is then easy to see (via simple calculus) that

huj ; Y i
copt
j =
1+ j

i.e.
M
X huj ; Y i
Y^ = v opt = uj
j=1
1+ j

From this it’s clear how the penalty structure dictates optimally shrinking the
components of the projection of Y onto spanfuj g.
It is further worth noting that for a given set of penalty coe¢ cients, Y^ can
be represented as SY for
M
X 1 1
S= dj uj u0j = U diag ;:::; U0
j=1
1+ 1 1+ M

for U = (u1 ; u2 ; : : : ; uM ). Then it’s easy to see that smoother matrix S is a


rank M matrix for which Y^ = SY .
One context in which this material might …nd immediate application is where
some set of basis functions fhj g are increasingly "wiggly" with increasing j and
the vectors uj come from applying the Gram-Schmidt process to the vectors
0
hj = (hj (x1 ) ; : : : ; hj (xN ))

In this context, it would be very natural to penalize the later uj more severely
than the early ones.

99
5.4 Graph-Based Penalized Fitting/Smoothing (and Semi-
Supervised Learning)
Another interesting smoothing methodology related to the material of the three
previous sections concerns use of …tting penalties based on the graph Lapla-
cians introduced in Section 2.4.3.22 Consider then N complete data cases
(x1 ; y1 ) ; : : : ; (xN ; yN ) and M 0 additional data cases where only inputs
xN +1 ; : : : ; xN +M are available. There is no necessity here that M > 0, but it
can be so in the event that predictions are desired at xN +1 ; : : : ; xN +M whose
values might not be in the training set. Where there are M > 0 genuine "unla-
beled cases" whose inputs are assumed to come from the same mechanism as the
inputs x1 ; : : : ; xN and might be used to more or less "…ll in" the relevant part
of the input space not covered by the complete/labeled data cases, the termi-
nology semi-supervised learning is sometimes used to describe the building
of a predictor for y at all N + M input vectors. The case M = 1 might be
used to simply make a single prediction at a single input not exactly seen in a
"usual" training set of N complete data pairs.
Suppose that following the development of Section 2.4.3 one can make an
adjacency matrix based on the N + M input vectors,
0 1
SL S LU
S = (sij ) i=1;:::;N +M = @ N N N M A
j=1;:::;N +M S UL SU
M N M M

and corresponding Laplacian and symmetric normalized Laplacian, respectively


0 1 0 1
LL LLU LL LLU
L=@ N N N M A and L = @ N N N M A
LUL LU LUL LU
M N M M M N M M

Then with 0 1
YL
Y =@ N 1 A
(N +M ) 1 YU
M 1

what one might wish to do is produce a vector of smoothed/…tted values Y^


(N +M ) 1
such that entries corresponding to input vectors with large adjacencies tend to
be alike. This is possible in way highly reminiscent of the material in Sections
5.1 and 5.3.
For v 2 <N +M written as
0 1
vL
v =@ N 1 A
(N +M ) 1 vU
M 1

2 2 The material here is adapted from "Graph-Based Semi-Spervised Learning with BIG

Data" by Banergee, Culp, Ryan, and Michailidis, that appeard in Research on Applied Cy-
bernetics and System Science in 2017.

100
consider the optimization problem in <N +M
0
minimize (Y L v L ) (Y L v L ) + v 0 Lv (88)
v2<N +M

for some > 0 (or the same with L replacing L in the quadratic penalty term).
The developments (52) and (53) of Section 2.4.3 show that upon expanding
v in terms of the N + M (orthonormal) eigenvectors of L (or L ) it follows
that components of v that are multiples of late eigenvectors (ones with small
eigenvalues)

1. have similar entries for cases with large adjacencies, and


2. are relatively lightly penalized in the minimization.

This strongly suggests that solutions to the optimization problem (88) will pro-
vide smoothed prediction vectors Y^ where entries with corresponding inputs
with large adjacencies are similar.
Recent work of Culp and Ryan provides theory, methods, and software for
solving the problem (88) and many nice generalizations of it (including consid-
eration of losses other than SEL that produce methods for classi…cation prob-
lems). For purposes of exposition here, we will provide the explicit solution
that is available for the SEL problem. It turns out that the problem (88) and
generalizations of it separate nicely into two parts. That is
opt opt
Y^ U = LU 1 LUL Y^ L (89)
opt
(or the same with L s replacing Ls) where Y^ L = v L solving
0 ~ L vL
minimize (Y L v L ) (Y L v L ) + v 0L L (90)
vL 2<N

~ L = LL LLU L 1 LUL (or, again, the same with L s replacing Ls). (Gen-
for L U
eralizations of the development here replace SSE in displays (88) and (90) with
other losses, but the form (89) is unchanged.) But the problem (90) is familiar
and its solution a simple consequence of vector calculus
opt 1
Y^ L = I + L
~L YL

This is exactly parallel to the displays (80) and (81) and the discussion around
1
them. I+ L ~L (and its starred version) is a smoother/shrinker matrix.
Further, the matrix LU 1 LUL in display (89) and its starred version are sto-
opt opt
chastic matrices and entries of Y^ U are averages of the elements of Y^ L .

101
6 Kernel and Local Regression Smoothing Meth-
ods and SEL Prediction
The central idea of this material is that when …nding f^ (x0 ) one might weight
points in the training set according to how close they are to x0 , do some kind
of …tting around x0 , and ultimately read o¤ the value of the …t at x0 .

6.1 One-dimensional Kernel and Local Regression Smoothers


For the time being, suppose that x takes values in [0; 1]. Invent weighting
schemes for points in the training set by de…ning a (usually, symmetric about
0) non-negative, real-valued function D (t) that is non-increasing for t 0 and
non-decreasing for t 0. Often D (t) is taken to have value 0 unless jtj 1.
Then, a kernel function23 is
x x0
K (x; x0 ) = D (91)

where is a "bandwidth" parameter that controls the rate at which weights


drop o¤ as one moves away from x0 (and indeed in the case that D (t) = 0 for
jtj > 1, how far one moves away from x0 before no weight is assigned). Common
choices for D are
3
1. the Epanechnikov quadratic kernel, D (t) = 4 1 t2 I [jtj 1],
3
3
2. the "tri-cube" kernel, D (t) = 1 jtj I [jtj 1], and

3. the standard normal density, D (t) = (t).

These three are pictured in Figure 24.


Using weights (91) to make a weighted average of training responses, one
arrives at the Nadaraya-Watson kernel-weighted prediction at x0
PN
^ K (x0 ; xi ) yi
f (x0 ) = Pi=1N
(92)
i=1 K (x0 ; xi )

This typically smooths training outputs yi in a more pleasing way than does
a k-nearest neighbor average, but it has obvious problems at the ends of the
interval [0; 1] and at places in the interior of the interval where training data are
dense to one side of x0 and sparse to the other, if the target E[yjx = z] has non-
zero derivative at z = x0 . For example, at x0 = 1 only xi 1 get weight, and
if E[yjx = z] is decreasing at z = x0 = 1, f^ (1) will be positively biased. That
is, with usual symmetric kernels, predictor (92) will fail to adequately follow an
obvious trend at 0 or 1 (or at any point between where there is a sharp change
in the density of input values in the training set).
2 3 This is again a potentially di¤erent usage of the word "kernel" than that in Section 1.4.3

and no non-negative de…niteness of the function is needed or assumed.

102
Figure 24: Three standard choices of D (t): Epanechnikov quadratic kernel
(blue), tricube (black), and standard normal density (red).

A way to address this problem with the Nadaraya-Watson predictor is to


replace the locally-…tted constant with a locally-…tted line. That is, at x0 one
might choose (x0 ) and (x0 ) to solve the optimization problem
N
X 2
minimize K (x0 ; xi ) (yi ( + xi )) (93)
and
i=1

and then employ the prediction

f^ (x0 ) = (x0 ) + (x0 ) x0 (94)

Now the weighted least squares problem (93) has an explicit solution. Let
0 1
1 x1
B 1 x2 C
B C
B =B . .. C
N 2 @ .. . A
1 xN

and take
W (x0 ) = diag (K (x0 ; x1 ) ; : : : ; K (x0 ; xN ))
N N

then predictor (94) is


1
f^ (x0 ) = (1; x0 ) B 0 W (x0 ) B B 0 W (x0 ) Y (95)
0
= l (x0 ) Y
1
for the 1 N vector l0 (x0 ) = (1; x0 ) B 0 W (x0 ) B B 0 W (x0 ). It is thus
obvious that locally weighted linear regression is (an albeit x0 -dependent) lin-
ear operation on the vector of outputs. The weights in l0 (x0 ) combine the

103
original kernel values and the least squares …tting operation to produce a kind
of "equivalent kernel" (for a Nadaraya-Watson type weighted average).
Recall that for smoothing splines, smoothed values are

Yb = S Y

where the parameter is the penalty weight, and

df ( ) = tr (S )

We may do something parallel in the present context. We may take


0 0 1
l (x1 )
B l0 (x2 ) C
B C
L =B .. C
N N @ . A
l0 (xN )

where now the parameter is the bandwidth, write

Yb = L Y

and de…ne
df ( ) = tr (L )
HTF suggest that matching degrees of freedom for a smoothing spline and a
kernel smoother produces very similar equivalent kernels, smoothers, and pre-
dictions.
There is a famous theorem of Silverman that adds technical credence to this
notion. Roughly the theorem says that for large N , if in the case p = 1 the
inputs x1 ; x2 ; : : : ; xN are iid with density p (x) on [a; b], is neither too big nor
too small,
1 juj juj
DS (u) = exp p sin p +
2 2 2 4
1=4
(x) =
N p (x)
and
1 z x
G (z; x) = DS
(x) p (x) (x)
then for xi not too close to either a or b,
1
(S )ij G (xi ; xj )
N
(in some appropriate probabilistic sense) and the smoother matrix for cubic
spline smoothing has entries like those that would come from an appropriate
kernel smoothing.

104
6.2 Local Regression Smoothing in p Dimensions
A direct generalization of 1-dimensional local regression smoothing to p dimen-
sions might go roughly as follows. For D as before, and x 2 <p , one might
set
kx x0 k
K (x0 ; x) = D (96)

and …t linear forms locally by choosing (x0 ) 2 < and (x0 ) 2 <p to solve the
optimization problem
N
X
0 2
minimize K (x0 ; xi ) yi + xi
and
i=1

and predicting as
f^ (x0 ) = (x0 ) + 0
(x0 ) x0
This seems typically to be done only after standardizing the coordinates of x
and can be e¤ective as longs as N is not too small and p is not more than 2
or 3. However for p > 3, the curse of dimensionality comes into play and N
points usually just aren’t dense enough in p-space to make direct use of kernel
smoothing e¤ective. If the method is going to be successful in <p it will need
to be applied under appropriate structure assumptions.
One way to apply additional structure to the p-dimensional kernel smoothing
problem is to essentially reduce input variable dimension by replacing the kernel
(96) with the "structured kernel"
0q 1
0
(x x0 ) A (x x0 )
K ;A (x0 ; x) = D @ A

for an appropriate non-negative de…nite matrix A. For the eigen decomposition


of A,
A = V DV 0
write
1 0 1
0
(x x0 ) A (x x0 ) = D 2 V 0 (x x0 ) D 2 V 0 (x x0 )

This amounts to using not x and <p distance from x to x0 to de…ne weights,
1 1 1
but rather D 2 V 0 x and <p distance from D 2 V 0 x to D 2 V 0 x0 . In the event
that some entries of D are 0 (or are nearly so), this basically reduces dimension
from p to the number of large eigenvalues of A and de…nes weights in a space
of that dimension (spanned by eigenvectors corresponding to non-zero eigenval-
ues) where the curse of dimensionality may not preclude e¤ective use of kernel
smoothing. The "trick" is, of course, identifying the right directions into which
to project. (Searching for such directions is part of the Friedman "projection
pursuit" ideas discussed below.)

105
7 High-Dimensional Use of Low-Dimensional
Smoothers and SEL Prediction
There are several ways that have been suggested for making use of fairly low-
dimensional (and thus, potentially e¤ective) smoothing in large p problems.
One of them is the "structured kernels" idea just discussed. Two more follow.

7.1 Structured Regression Functions


7.1.1 Additive Models
A way to apply structure to the p-dimensional smoothing problem is through
assumptions on the form of the predictor …t. For example, one might assume
additivity in a form
Xp
f (x) = + gj (xj ) (97)
j=1

and try to do …tting of the p functions gj and constant .


One more or less ad hoc method of …tting forms like form (97) is the so-called
"back-…tting algorithm." That is to (generalize form (97) slightly and) …t
(under SEL)
XL
f (x) = + gl xl (98)
l=1
l 1
PN
for x some part of x, one might set b = N i=1 yi , and then cycle through
l = 1; 2; : : : ; L; 1; 2; : : :

1. …tting via some appropriate (often linear) operation (e.g., spline or kernel
smoothing)
gl xl to "data" xli ; yil i=1;2;:::;N
for 0 1
X
yil = yi @b + gm (xm
i )
A
m6=l

where the gm are the current versions of the …tted summands,


2. setting

gl = the newly …tted version


the sample mean of this newly …tted version across all xli

(in theory this is not necessary, but it is here to prevent numerical/round-


o¤ errors from causing the gm to drift up and down by additive constants
summing to 0 across m),

3. iterating until convergence to, say, f^ (x).

106
A more principled SEL …tting methodology for additive forms like that in
display (98) (e.g. implemented by Wood in his mgcf R package) is the simulta-
neous …tting of and all the functions gl via penalized least squares. That is,
using an appropriate set of basis functions for smooth functions of xl (often a
tensor product basis in the event that the dimension of xl is more than 1) each
gl might be represented as a linear combination of those basis functions. Then
form (98) is in fact a constant plus a linear combination of basis functions. So
upon adopting a quadratic penalty for the coe¢ cients, one has a kind of ridge
regression problem and explicit forms for all …tted coe¢ cients and b. The
practical details of making the various bases and picking ridge parameters, etc.
are not trivial, but the basic idea is clear.
The simplest version of this line of development, based on form (97), might
be termed …tting of a "main e¤ects model." But the approach might as well
be applied to …t a "main e¤ects and two factor interactions model," using some
gl s that are functions of only one coordinate of x and others that depend upon
only two coordinates of the input vector. One may mix types of predictors
(continuous, categorical) and types of functions of them in the additive form
to produce all sorts of interesting models (including semi-parametric ones and
ones with low order interactions).

7.1.2 Other Structured Regression Forms


Another possibility for introducing structure assumptions and making use of
low-dimensional smoothing in a large p situation, is by making strong global
assumptions on the forms of the in‡uence of some input variables on the output,
but allowing parameters of those forms to vary in a ‡exible fashion with the
values of some small number of coordinates of x. For sake of example, suppose
that p = 4. One might consider predictor forms
f (x) = (x3 ; x4 ) + 1 (x3 ; x4 ) x1 + 2 (x3 ; x4 ) x2
That is, one might assume that for …xed (x3 ; x4 ), the form of the predictor is
linear in (x1 ; x2 ), but that the coe¢ cients of that form may change in a ‡exible
way with (x3 ; x4 ). Fitting might then be approached by locally weighted least
squares, with only (x3 ; x4 ) involved in the setting of the weights. That is, one
might for each (x30 ; x40 ), minimize over choices of (x30 ; x40 ) ; 1 (x30 ; x40 ) and
2 (x30 ; x40 ) the weighted sum of squares

N
X 2
K ((x30 ; x40 ) ; (x3i ; x4i )) (yi ( (x30 ; x40 ) + 1 (x30 ; x40 ) x1i + 2 (x30 ; x40 ) x2i ))
i=1

and then employ the predictor


f^ (x0 ) = ^ (x30 ; x40 ) + ^1 (x30 ; x40 ) x10 + ^2 (x30 ; x40 ) x20
This kind of device keeps the dimension of the space where one is doing smooth-
ing down to something manageable. But note that nothing here does any
thresholding or automatic variable selection.

107
7.2 Projection Pursuit Regression
For w1 ; w2 ; : : : ; wM unit p-vectors of parameters, we might consider as predic-
tors …tted versions of the form
M
X
f (x) = gm (w0m x) (99)
m=1

This is an additive form in the derived variables vm = w0m x. The functions gm


and the directions wm are to be …t from the training data. The M = 1 case of
this form is the "single index model" of econometrics.
How does one …t a predictor of this form (99)? Consider …rst the M = 1
case. Given w, there are pairs (vi ; yi ) for vi = w0 xi and a 1-dimensional
smoothing method can be used to estimate g. On the other hand, given g, one
might seek to optimize w via an iterative search. A Gauss-Newton algorithm
can be based on the …rst order Taylor approximation
g (w0 xi ) g (w0old xi ) + g 0 (w0old xi ) (w0 w0old ) xi
so that
N
X N
X 2
0 2 2 yi g (w0old xi )
(yi g (w xi )) (g 0 (w0old xi )) w0old xi + w0 xi
i=1 i=1
g 0 (w0old xi )
2
Then wold may be updated to w using the closed form for weighted (by (g 0 (w0old xi )) )
no-intercept regression of
yi g (w0old xi )
w0old xi +
g0 (w0old xi )
on xi . (Presumably one must normalize the updated w in order to preserve unit
length property of the w in order to maintain a stable scaling in the …tting.) The
g and w steps are iterated until convergence. Note that in the case where cubic
smoothing spline smoothing is used in projection pursuit, g 0 will be evaluated
as some explicit quadratic and in the case of locally weighted linear smoothing,
form (95) will need to be di¤erentiated in order to evaluate the derivative g 0 .
When M > 1, terms gm (w0m x) are added to a sum of such in a forward
stage-wise fashion. HTF provide some discussion of details like readjusting
previous gs (and perhaps ws) upon adding gm (w0m x) to a …t, and the choice of
M.

8 Highly Flexible Non-Linear Parametric Pre-


diction Methods
8.1 Neural Network Regression
A multi-layer feed-forward neural network is a nonlinear map of x 2 <p to one
or more outputs through the use of non-linear functions of linear combinations

108
of non-linear functions of linear combinations of ... non-linear functions of linear
combinations of coordinates of x. Figure 25 is a network diagram representa-
tion of a toy single hidden layer feed-forward neural net with 3 inputs, 2 hidden
nodes, and 2 outputs.24 The constants x0 = 1 and z0 = 1 allow for "bi-
ases" (i.e. constant terms) in the linear combinations (technically making them
"a¢ ne" transformations rather than linear ones). The and parameters are
sometimes called "weights."

Figure 25: A Network Diagram Representation of a Single Hidden Layer Feed-


foward Neural Net With 3 Inputs, 2 Hidden Nodes and 2 Ouputs.

This diagram stands for a function of x de…ned by setting


z1 = ( 01 1+ 11 x1 + 21 x2 + 31 x3 )
z2 = ( 02 1+ 12 x1 + 22 x2 + 32 x3 )

and then
y1 = g1 ( 01 1+ 11 z1 + 21 z2 ; 02 1+ 12 z1 + 22 z2 )
y2 = g2 ( 01 1+ 11 z1 + 21 z2 ; 02 1+ 12 z1 + 22 z2 )

In SEL/regression contexts, identity functions of a single one of the arguments


are common and natural for the functions g.
Originally, the most common choice of functional form (the so-called "acti-
vation function") at hidden nodes was the (sigmoidal-shaped) logistic function25
1
(u) =
1 + exp ( u)
2 4 Of course, much more complicated networks are possible, particularly ones with multiple

hidden layers and many nodes on all layers.


2 5 Other functions with similar shapes, like the inverse standard normal cdf, were also used.

109
or the (completely equivalent in this context26 ) hyperbolic tangent function
exp (u) exp ( u)
(u) = tanh (u) =
exp (u) + exp ( u)
These functions are di¤erentiable at u = 0, so that for small s the functions
of x entering the gs in a single hidden layer network are nearly linear. For large
s the functions are nearly step functions. In light of the latter, it is not
surprising that there are universal approximation theorems that guarantee that
any continuous function on a compact subset of <p can be approximated to any
degree of …delity with a single layer feed-forward neural net with enough nodes
in the hidden layer. This is both a blessing and a curse. It promises that
these forms are quite ‡exible It also promises that there must be both over-
…tting and identi…ability issues inherent in their use (the latter in addition to the
identi…ability issues already inherent in the symmetric nature of the functional
forms assumed for the predictors).
More recently, sigmoidal forms for the activation function have declined in
popularity. Instead, the hinge or positive part function

(u) = max (u; 0) = u+

is often used. In common parlance, this makes the hidden nodes "recti…ed
linear units" (ReLUs). Note that this choice makes functions of x entering an
output layer piece-wise linear and continuous (not at all an unreasonable form).

8.2 Neural Network Classi…cation


In K-class classi…cation problems, it is typical to use K output nodes and for
w = (w1 ; w2 ; : : : ; wK ) the vector of linear combinations of outputs from the
…nal hidden layer, compute the outputs not simply using a single entry of w for
each, but rather using all entries. That is, it is typical to set K outputs to be
exp (wk )
gk (w) = (100)
P
K
exp (wl )
l=1

This vector function of (vector) w is usually referred to as the "softmax" func-


tion, and produces a probability vector as output. Its entries serve as estimates
of class probabilities for the given vector of inputs. The 0-1 loss classi…er
corresponding to this set of estimated class probabilities is then

f^ (x) = arg maxgk


k

(where it is understood that the kth probability, gk , depends upon the input
x through the neural net compositions of functions and the …nal use of the
softmax function).
2 6 This 1
is because tanh (u) = 2 1+exp( 2u)
1.

110
8.3 Fitting Neural Networks
8.3.1 The Back-Propagation Algorithm
The most common …tting algorithm for neural networks is something called the
"back-propagation algorithm" or the "delta rule." It is simply a gradient descent
algorithm for the entire set of weights involved in making the outputs (in the
simple case illustrated in Figure 25, the s and s). Rather than labor through
the nasty notational issues required to completely detail such an algorithm, we
will here only lay out the heart of what is needed.
For a training set of size N , loss L f^(xi ) ; yi incurred for input case i when
the K predictions f^k (xi ) are made (corresponding to the K output nodes) and
a sum of such losses is to be minimized, if one can …nd the partial derivatives of
the coordinates of f^(x) with respect to the weights, the chain rule will give the
partials of the total loss and allow iterative search in the direction of a negative
gradient of the total loss. So we begin with description of how to …nd partials
for f^k (x), a coordinate of the …tted output vector.
Consider a neural network with H layers of hidden nodes indexed by h =
1; 2; : : : ; H beginning with the layer immediately before the output layer and
proceeding (right to left in a diagram like Figure 25) to the one that is built
from linear combinations of the coordinates of x. We’ll use the notation mh
for the number of nodes in layer h, including a node representing the "bias"
input 1 (represented by x0 = 1 and z0 = 1 in Figure 25). For a real-valued
activation function of a single real variable , de…ne a vector-valued function
m m
m :< !< by

m (u1 ; u2 ; : : : ; um ) = ( (u1 ) ; (u2 ) ; : : : ; (um ))


In what follows (for purposes of reducing notational clutter) we will abuse nota-
tion somewhat and not subscript m , but rather write only , leaving it to the
reader to recall that outputs vectors of the same dimension as its argument.
And it will be convenient to presume that both the input and output of a are
row vectors.
Then for AH a (p + 1) (mH 1) matrix of (weight) parameters we can
represent the relationship between the input x and vector of values (say z H ) in
the last hidden layer by

z 0H = 1; (1; x0 ) AH

Next, for AH 1 an mH (mH 1 1) matrix of parameters we may represent


the relationship between the vectors of values in the last and next to last hidden
layers by
z 0H 1 = 1; z 0H AH 1

and so on to the h = 1 case of (Ah an mh+1 (mh 1) matrix of parameters)

z 0h = 1; z 0h+1 Ah (101)

111
Then for A0 an m1 K matrix of parameters and gk a function of K real
variables, the kth coordinate of the output is

gk z 01 A0 (102)

This series of relationships allows (via what is known as a "forward pass"


through them) the computation of zs and predictions for a …xed set of coe¢ -
cients collected in the As and an input vector x. Then partial derivatives of
the kth coordinate of the response (at that input and set of coe¢ cients) can be
found via the "backward pass" based on the K partials of gk , the derivative of
, the recursions above, and the results of the forward pass.
(l)
For example, for gk the partial of the function gk with respect to its lth
entry, the partial derivative of the kth coordinate of the prediction with respect
to the (i; j) entry of A0 is from relationship (102) and the chain rule
(j)
gk z 01 A0 z1i

Further, since using form (102) and the h = 1 version of form (101) the kth
coordinate of the prediction is

gk 1; z 02 A1 A0

writing a1ij for the (i; j) entry of A1 , the chain rule implies that (with A0l the
lth column of A0 ) the partial derivative of the kth coordinate of the prediction
with respect to a1ij is

K
X (l) @
gk 1; z 02 A1 A0 1; z 02 A1 A0
@a1ij l
l=1
K
X (l) @
= gk 1; z 02 A1 A0 1; z 02 A1 A0l
@a1ij
l=1
K
X K
X
(l) @
= gk 1; z 02 A1 A0 a0kl 1; z 02 A1
@a1ij k
l=1 k=1
K
X (l) @
= gk 1; z 02 A1 A0 a0jl z 02 A1j
@a1ij
l=1
K
X (l)
= gk 1; z 02 A1 A0 0
z 02 A1j a0jl z2i
l=1

"and so on" for other y^k s and ahij s.


In general one is faced with the functional form for the kth coordinate of the
output

gk 1; 1; 1; 1; (1; x0 ) AH AH 1
A2 A1 A0

112
made by successive compositions using the activation function and linear com-
@ y^k
binations with coe¢ cients in the matrices Ah , from which partials are
@ahij
obtainable in the style above, by repeatedly using the chain rule. No doubt
some appropriate use of vector calculus and corresponding notation could im-
prove the looks of these expressions and recursions can be developed, but what
is needed should be clear. Further, in many contexts numerical approximation
of these partials may be the most direct and e¢ cient means of obtaining them.
Then for loss L f^; y let
@
Lk f^; y = L f^; y
@ f^k
For a an element of one of the Ah matrices, the partial derivative of the contri-
bution of case i to a total loss with respect to it is
K
X @
Lk f^(xi ) ; yi gk z 01 (xi ) A0
@a
k=1

(for z 1 (xi ) the set of values from the …nal hidden nodes and partials found as
above) and the partial derivative of the total loss with respect to it is
N X
X K
@
D (a) = Lk f^(xi ) ; yi gk z 01 (xi ) A0
i=1 k=1
@a
The gradient of the total loss as a function of the matrices of weights then has
entries D (a) and an iterative search to optimize total loss with a current set of
iterates acurrent can produce new iterates
anew = acurrent D (acurrent ) (103)
for some "learning rate" > 0.
Of course, in SEL/univariate regression contexts, it is common to have K = 1
2
and take L f^; y = f^ y . In K-class classi…cation models, it seems most
^ = g1 z 01 A0 ; g2 z 01 A0 ; : : : ; gK z 01 A0
common to use a K-dimensional output g
with the "softmax" gk as de…ned in display (100) and to employ the cross-
entropy loss
XK
L (^
g ; y) = I [y = k] ln gk (x)
k=1
There are various possibilities for regularization of the ill-posed …tting prob-
lem for neural nets, ranging from the fairly formal and rational to the very
informal and ad hoc. One possibility is to employ "stochastic gradient de-
scent" and newly choose a random subset of the training set for use at each
iteration of …tting. (It is popular to even go so far in this regard as to employ
only a single case at each iteration.) Another common approach is to simply
use an iterative …tting algorithm and "stop it before it converges." We proceed
to brie‡y discuss more formal regularization.

113
8.3.2 Formal Regularization of Fitting
Suppose that the various coordinates of the input vectors in the training set
have been standardized and one wants to regularize the …tting of a neural net.
One possible way of proceeding is to de…ne a penalty function like
H X
X 2
J (A) = ahij (104)
h=0 i;j

for A standing for the entire set of weights in A0 ; A1 ; : : : ; AH (it is not ab-
solutely clear whether one really wants to include the weights on the "bias"
terms in the neural net sums in (104)) and seek not to partially optimize the
PN
total training set loss i=1 L f^A (xi ) ; yi but rather to fully optimize

N
X
L f^A (xi ) ; yi + J (A) (105)
i=1

for a > 0. By modifying the recursion (103) to

anew = acurrent (D (acurrent ) + 2 acurrent )

one arrives at an appropriate gradient descent algorithm for optimizing the


penalized training loss (105). (Potentially, an appropriate value for might be
chosen based on cross-validation.)
Something that may at …rst seem quite di¤erent would be to take a Bayesian
point of view. For example, with a univariate regression model for outputs

yi = f (xi jA) + i

2
for the i iid N 0; , a likelihood is simply
N
Y
2 2
l A; = h yi jf (xi jA) ;
i=1

2
for h j ; the normal pdf. If then g A; 2 speci…es a prior distribution
2
for A and , a posterior for A; 2 has density proportional to
2 2
l A; g A;

For example, one might well assume that a priori the as are iid N 0; 2 (where
small 2 will provide regularization and it is again unclear whether one wants
to include the as corresponding to bias terms in such an assumption or to
instead provide more di¤use priors for them, like improper "Uniform( 1; 1)"
or at least large variance normal ones). A standard improper prior for 2
is ln Uniform( 1; 1). In any case, whether improper or proper, abuse
notation and write g 2 for a prior density for 2 .

114
Then with independent mean 0 variance 2 priors for all the weights (except
possibly the ones for bias terms that might be given Uniform( 1; 1) priors)
one has
2 2
ln l A; g A;
N
1 X 2 1 2
/ N K ln ( ) 2
(yi f (xi jA)) J (A) + ln g
2 i=1
2 2
N
!
1 X 2
2
2
= N K ln ( ) + ln g 2
(yi f (xi jA)) + 2
J (A) (106)
i=1

(‡at improper priors for the bias weights correspond to the absence of terms
for them in the sums for J (A) in form (104)). This recalls display (105) and
suggests that appropriate for regularization can be thought of as a variance
ratio of "observation variance" and prior variance for the weights.
It’s fairly clear how to de…ne Metropolis-Hastings-within-Gibbs algorithms
for sampling from l A; 2 g A; 2 . But it seems that typically the high di-
mensionality of the parameter space combined with the symmetry-derived multi-
modality of the posterior will prevent one from running an MCMC algorithm
long enough to fully detail the posterior It also seems unlikely however, that
detailing the posterior is really necessary or even desirable. Rather, one might
simply run the MCMC algorithm, monitoring the values of l A; 2 g A; 2
corresponding to the successively randomly generated MCMC iterates. An
MCMC algorithm will spend much of its time where the corresponding poste-
rior density is large and we can expect that a long MCMC run will identify
a nearly modal value for the posterior. Rather than averaging neural nets
according to the posterior, one might instead use as a predictor a neural net
corresponding to a parameter vector (at least locally) maximizing the posterior.
Notice that one might even take the parameter vector in an MCMC run
with the largest l A; 2 g A; 2 value and for a grid of 2 values around
the empirical maximizer use the back-propagation algorithm modi…ed to fully
optimize
XN 2
2
(yi f (xi jA)) + 2 J (A)
i=1

over choices of A. This, in turn, could be used with relationship (106) to perhaps
improve somewhat the result of the MCMC "search."

8.4 Convolutional Neural Networks


An application of neural network type ideas that has received much recent at-
tention is that of image classi…cation. We will here provide a short introduction
to the area. Not surprisingly, success in this realm seems to rely as much upon
ideas from image processing as upon ideas from prediction.

115
Mathematically, a grey-scale image is typically represented by an L M
matrix X = [xlm ] where each xlm 2 f0; 1; 2; : : : ; 254; 255g represents a bright-
ness at location (l; m). A color image is often represented by 3 matrices
X r = [xlm ] ; X g = [xlm ] ; and X b = [xlm ] (again all with integer en-
L M L M L M
tries in f0; 1; 2; : : : ; 254; 255g) representing intensities in red, green, and blue
"channels." The standard machine h learning
i problem is to (based on a train-
r g b
ing set of N images X i or X ; X ; X with corresponding class identities
i
yi 2 f1; 2; : : : ; Kg) produce a classi…er. (For example, a standard test problem
is "automatic" recognition of hand-written digits 0 through 9.)
Simple convolutional neural networks with H hidden layers and a softmax
output layer producing class probabilities are successive compositions of more
or less natural linear and non-linear operations that might be represented as
follows. For H operating on X using some set of real number parameters
AH to produce some multivariate output (we will describe below some kinds
of things that are popularly used) a "deepest layer" of the convolutional neural
net produces
ZH H
X; AH (107)

Then applying another set of operations H 1 to the result (107) using some
set of parameters AH 1 , the next layer of values in the convolutional neural net
is produced as
Z H 1 = H 1 Z H ; AH 1
and so on, with
Zh = h
Z h+1 ; Ah (108)

for h = H; H 1; : : : ; 1 where 1 is <K -valued (the top layer of hidden values is


a K-vector). Then, with g the softmax function the output K-vector of class
probabilities is
g Z1
In multi-channel cases, it seems common to develop separate series of composi-
tions based on X r ; X g ; and X b and to bring them together only in the top or
top few levels of this kind of hierarchy.
Variants of this basic structure are possible and have been used. For ex-
ample, it is sometimes done to make a "direct connection" between layer h and
one deeper than layer h + 1: That is the option to employ a form

Zh = h
Z h+1 ; Z h+j ; Ah

for some j > 1 or even a form

Zh = h
Z h+1 ; X; Ah

making a direct connection to the input layer is sometimes employed. (Obvi-


ously, even more complicated schemes are possible.)

116
Most of what we have said thus far in this section is not really special to
the problem of image classi…cation (and could serve as a high-level introduction
to general neural net predictors). What sets the "convolutional" neural net-
work …eld apart from "generic" neural network practice is the image-processing-
inspired forms employed in the functions h . The most fundamental form is
one that applies "linear …lters" to images followed by some nonlinear operation.
This creates what is commonly called a "convolution" layer.
To make the idea of a convolutional layer precise, consider the following.
Let F be an R C matrix. Typically this matrix is much smaller than the
image and square (at least when "horizontal" and "vertical" resolutions in the
images are the same), and R and C are often odd. One can then make from
F and X a new matrix F X of dimension (L R + 1) (M C + 1) with
entries
XR XC
(F X)ij = fab x(i+a 1);(j+b 1) (109)
a=1 b=1

A natural way to think about this operation is to align an R C integer grid


with values in F on the grid points with a (larger) corresponding grid for the
image X, setting the upper left (1; 1) corner of the F grid at the (i; j) location
on the X grid, and to then sum products of aligned matrix entries. The entries
of F serve as weights on the values in the R C part of the image aligned with
the …lter matrix. Figure 26 illustrates this process for a simple case where F
is 3 3 (and so ultimately F X has 2 fewer columns and 2 fewer rows than
X and is thus (L 2) (M 2)).

Figure 26: Illustration of the use of the 3 3 …lter matrix F with L M image
matrix X to produce the (L 2) (M 2) matrix F X.

117
This convolution operation is linear and it is typical practice to introduce
non-linearity by following convolution operations in a layer with the hinge func-
tion max (u; 0) applied to each element u of the resulting matrix. Sometimes
people (apparently wishing to not lose rows and columns in the convolution
process) "0 pad" an image with extra rows and columns of 0s before doing the
convolution–a practice that strikes this author as lacking sound rationale.
Multiple convolutions are typically created in a single convolution layer.
Sometimes the …lter matrices are …lled with parameters to be determined in
…tting (i.e. are part of Ah in the representation (108)). But they can also be
…xed matrices created for speci…c purposes. For example the 3 3 matrices
2 3 2 3
1 0 1 1 2 1
S vert = 4 2 0 2 5 and S horiz = 4 0 0 0 5
1 0 1 1 2 1

are respectively the vertical and horizontal Sobel …lter matrices, commonly used
in image processing when searching for edges of objects or regions. And various
"blurring" …lters (ordinary arithmetic averaging across a square of pixels and
weighted averaging done according to values of a Gaussian density set at the
center of an integer grid) are common devices meant to suppress noise in an
image.
As multiple layers each with multiple new convolutions are created, there is
potential explosion of the total dimensionality of the sets of Z h and Ah . Two
devices for controlling that explosion are the notions of sampling and pooling
to reduce the size of a Z. First, instead of creating and subsequently using an
entire …ltered image F X, one can use only every sth row and column. In
such a "sampling" operation s is colloquially known as the "stride." Roughly
speaking, this reduces the size of a Z by a factor of s2 . Another possibility is
to choose some block size, of size say s t, and divide an L M image into
roughly
L M
s t
non-overlapping blocks, within a block applying a "pooling" rule like "simple
averaging" or "maximum value." One then uses the rectangular array of these
pooled values as a layer output. This, of course, reduces the size of a Z by a
factor of roughly st. It seems common to apply one of these ideas after each one
or few convolution layers in a network, and especially before reaching the top
and …nal one or few layers. The …nal hidden layers of a convolutional neural net
are of the "ordinary" type described earlier and if the dimensionality of their
inputs are too large, numerical and …tting problems will typically ensue.

8.5 Recurrent Neural Networks


Another context where neural network ideas have found application is that of
(non-linear) time series prediction. That is, vectors of inputs and outputs are
sometimes indexed with time order and one expects information from previous

118
periods to be of help in predicting response at the current one. To give some
sense of what can be done, consider a generalization of the toy single hidden
layer feed-forward neural net with 3 inputs, 2 hidden nodes, and 2 outputs used
in Section 8.1. Where input/output pairs (x; y) with x 2 <3 and y 2 <2 are
indexed by (time) integer t, the notion of recurrent neural network practice is
to allow values z1t and z2t at the hidden nodes to depend not only upon xt but
also upon z1t 1 and z2t 1 and/or y t 1 .
A so-called Elman Network replaces the basic expressions for moving from
input to hidden layer in Section 8.1 with

z1t = ( 01 1+ 11 x1t + 21 x2t + 31 x3t + 11 z1t 1 + 21 z2t 1 )


z2t = ( 02 1+ 12 x1t + 22 x2 t + 32 x3t + 12 z1t 1 + 22 z2t 1 )

and a Jordan Network replaces them with

z1t = ( 01 1+ 11 x1t + 21 x2t + 31 x3t + 11 y1t 1 + 21 y2t 1 )


z2t = ( 02 1+ 12 x1t + 22 x2t + 32 x3t + 12 y1t 1+ 22 y2t 1 )

These are obviously some kind of non-linear auto-regressive relationships and


introduce additional weights that must be …t in order to apply the prediction
methodology.
It’s obvious that once one opens this line of thinking many more complicated
forms are possible. Forms for values at current hidden nodes could be postulated
to depend explicitly on values at hidden nodes or outputs further in the past
than period t 1. Both values at hidden nodes and outputs could be involved.
Etc. Fitting algorithms based on gradient descent are tailored to the particular
recurrence relationships employed.

8.6 Radial Basis Function Networks


Section 6.7 of HTF considers the use of the kind of kernels applied in kernel
smoothing as basis functions. That is, for
kx k
K (x; ) = D

one might consider …tting nonlinear predictors of the form


M
X
f (x) = 0 + jK x; j (110)
j=1

where each basis element has prototype parameter and scale parameter . A
common choice of D for this purpose is the standard normal pdf.
A version of this with fewer parameters is obtained by restricting to cases
where 1 = 2 = = M = : This restriction, however, has the potentially
unattractive e¤ect of forcing "holes" or regions of <p where (in each) f (x) 0,
including all "large" x. A way to replace this behavior with potentially di¤ering

119
values in the former "holes" and directions of "large" x is to replace the basis
functions !
x j
K x; j = D

with normalized versions


D x j =
hj (x) = PM
k=1 D (kx kk = )

to produce a form
M
X
f (x) = 0 + j hj (x) (111)
j=1

The …tting of form (110) by choice of 0 ; 1 ; : : : ; M ; 1 ; 2 ; : : : ; M ; 1 ; 2 ; : : : ;


M or form (111) by choice of 0 ; 1 ; : : : ; M ; 1 ; 2 ; : : : ; M ; is fraught with
all the problems of over-parameterization and lack of identi…ability associated
with neural networks.
Another way to use radial basis functions to produce ‡exible functional
forms is to replace the forms ( 0m + 0m x) in a neural network with forms
K m ( 0m x; m ) or hm ( 0m x).

9 Prediction Methods Based on Rectangles: Trees


and PRIM
This section begins something genuinely new to our discussion. That is the
search for good predictors that are constant on p-dimensional "rectangles" in
the input space, that is on subsets of <p of the form

R = fx 2 <p ja1 < x1 < b1 and a2 < x2 < b2 . . . and ap < xp < bp g

for (possibly in…nite) values aj < bj for j = 1; 2; : : : ; p. The basic idea is that
if the values aj and bj can be chosen so that ys corresponding to vectors of
inputs x in a training set in a particular rectangle are "homogeneous," then
a corresponding SEL predictor using training set "rectangle mean responses"
or a 0-1 loss classi…er using training set "rectangle majority classes" might be
approximately optimal.27
The search for good predictors constant on rectangles is fundamentally an
algorithmic matter, rather than something that will have a nice closed form
representation (it is not like ridge regression for example). But (provided
"fast" and "e¤ective" algorithms can be identi…ed) it has things that make it
very attractive. For one thing, there is complete invariance to monotone
transformation of numerical features. It is irrelevant to searches for good
2 7 This is essentailly the same motivation provided for nearest neighbor rules in Section

1.3.3.

120
boundaries for rectangles whether a coordinate of the input x is expressed on an
"original" scale or a log scale or on another (monotone transform of the original
scale). The same predictor/predictions will result. This is a very attractive
and powerful feature and is no doubt partly responsible for the popularity of
rectangle-based predictors as building blocks for more complicated methods (like
"boosting trees").
The structure of predictors constant on rectangles is also an intuitively ap-
pealing one, easily explained and understood. This helps make them very
popular with non-technical consumers of predictive analytics.
In this section we consider two rectangle-based prediction methods, the …rst
(CART) using binary tree structures and the second (PRIM) employing a kind
of "bump-hunting" logic.

9.1 Regression and Classi…cation Trees (CART)


The common acronym for this methodology is CART (classi…cation and regres-
sion trees) and classi…cation trees are sometimes referred to as "decision trees."
Here we’ll …rst consider the SEL/regression version and then the classi…cation
version.

9.1.1 Regression Trees


We consider a forward-selection/"greedy" algorithm for inventing predictions
constant on p-dimensional rectangles, by successively looking for an optimal
binary split of a single one of an existing set of rectangles. De…ne

aj = min xij and bj = max xij


i=1;2;:::;N i=1;2;:::;N

Begin with the rectangle in <p


p
Y
R= [aj ; bj ] = fx 2 <p jeach aj xj bj g
j=1

and look for an index j1 and a value aj1 < s1 < bj1 (with s1 6= xij1 for any i) so
that splitting the initial rectangle at xji = s1 (to produce the two sub-rectangles
R \ fx 2 <p jxj1 s1 g and R \ fx 2 <p jxj1 > s1 g) so that the resulting two
rectangles minimize
X X 2
SSE = yi y rectangle
rectangles i with xi in
the rectangle

One then splits (optimally) one of the (now) two rectangles on some variable
xj2 at some s2 (with s2 6= xij2 for any i) etc.
Where l rectangles in <p (say R1 ; R2 ; : : : ; Rl ) have been created, and

m (x) = the index of the rectangle to which x belongs

121
the corresponding SEL tree predictor is
1 X
f^l (x) = yi
# training input vectors xi in Rm(x)
i with xi
in Rm(x)

and in this notation the training error is


N
X 2
SSE = yi f^l (xi )
i=1

or the corresponding mean squared prediction error


1
err = SSE
N
If one is to continue beyond l rectangles, one then looks for a value sl to split
one of the existing rectangles R1 ; R2 ; : : : ; Rl on some xjl and thereby produce
the greatest reduction in SSE. (We note that there is no guarantee that after
l splits one will have the best (in terms of SSE) possible set of l + 1 rectangles.)
Any series of binary splits of rectangles can be represented graphically as a
binary tree, each split represented by a node where there is a fork and each …nal
rectangle by an end node. It is convenient to discuss rectangle-splitting (and
"unsplitting") in terms of operations on corresponding binary trees, and hence-
forth we adopt this language. Figures 27 and 28 provide three representations
of the same hypothetical regression tree with p = 2 ... a predictor constant on
rectangles in <2 .

Figure 27: Two representations of a hypothetical tree predictor for p = 2.

The basic formulation of the tree-growing method here employs "one-step-


at-a-time"/"greedy" (unable to defer immediate reward for the possibility of
later success) methods. Like all such methods, they are not guaranteed to
follow paths through the set of trees that ever get to "best" ones since they are
"myopic," never considering what might be later in a search, if a current step
were taken that provides little immediate payo¤.

122
Figure 28: A third representation of the hypothetical p = 2 tree predictor
portrayed in Figure 27.

A regression tree example (essentially suggested by Mark Culp) dramatically


illustrates this limitation. Consider a p = 3 case where x1 2 f0; 1g ; x2 2 f0; 1g ;
and x3 2 [0; 1]. In fact, suppose that x1 and x2 are iid Bernoulli(:5) independent
of x3 that is Uniform(0; 1). Then suppose that conditional on (x1 ; x2 ; x3 ) the
output y is N( (x1 ; x2 ; x3 ) ; 1) for

(x1 ; x2 ; x3 ) = 1000 I [x1 = x2 ] + x3

For a big training sample iid from this joint distribution, all branching will typi-
cally be done on the continuous variable x3 , completely missing the fundamental
fact that it is the (joint) behavior of (x1 ; x2 ) that drives the size of y. (This
example also supports the conventional wisdom that as presented the splitting
algorithm "favors" splitting on continuous variables over splitting on values of
discrete ones.)

9.1.2 Classi…cation Trees


The "classi…cation trees" version of this material is very similar to the continuous
y (SEL) regression tree version. One needs only to de…ne an empirical loss to
associate with a given tree parallel to SSE used above. To that end, note that
in a K-class problem (where y takes values in G = f1; 2; : : : ; Kg) corresponding
to a particular rectangle Rm is the fraction of training vectors with classi…cation
k,
1 X
pd
mk = I [yi = k]
# training input vectors in Rm
i with xi
in Rm

and a plausible G-valued predictor based on l rectangles is

f^l (x) = arg max p\


m(x)k
k2G

123
the class that is most heavily represented in the rectangle to which x belongs.28
The empirical misclassi…cation rate for this predictor (that can be used as a
rectangle-splitting criterion) is

1 X h i
N l
1 X
err = I yi 6= f^l (xi ) = Nm 1 p\
mk(m)
N i=1 N m=1

where Nm = # training input vectors in Rm , and k (m) = arg max pd


mk . Two
k2G
other popular splitting criteria are "the Gini index"
l K
!
1 X X
err = Nm pd
mk (1 pd
mk )
N m=1
k=1

and the so-called "cross entropy"


l K
!
1 X X
err = Nm pd d
mk ln (pmk )
N m=1
k=1

These latter two criteria are average (across rectangles) measures of "purity"
(near degeneracy) of training set response distributions in the rectangles. Upon
adopting one of these forms to replace SSE in the regression tree discussion,
one has a classi…cation tree methodology. HTF suggest using the Gini index
or cross entropy for tree growing and any of the indices (but most typically the
empirical misclassi…cation rate) for tree pruning according to cost-complexity
(to be discussed next).

9.1.3 Optimal Subtrees


It is, of course, possible to continue splitting rectangles/adding branches to a
tree until every distinct x in the training set has its own rectangle. But that is
not helpful in a practical sense, in that it corresponds to a very "low bias/high
variance"/complex predictor. So how does one …nd a tree of appropriate size?
How does one choose a size at which to stop growing a tree, or more generally,
prune a large tree back to a good size? (This latter is more general in that
pruning a tree can produce subtrees not met in a sequence of trees as built up to
a …nal one.) It turns out that it is possible to e¢ ciently …nd a nested sequence
of "optimal" subtrees of a large tree, and that methodology can in turn be used
in cross-validation.
For T a subtree of some …xed large tree T0 (e.g. grown until the cell with
the fewest training xi contains 5 or less such points or in classi…cation contexts
until the training error is 0) write
N
X
E (T ) = N err = L f^ (xi ) ; yi
i=1
2 8 Much as we noted regarding nearest neighbor methods in Section 1.3.3, it can in some

contexts be more useful to have the pdmk values themselves than to have only the 0-1 loss
classi…er derived from them.

124
(the total training error for the tree predictor based on T ). For > 0 de…ne
the quantity
C (T ) = jT j + E (T )
(for, in the obvious way, jT j the number of …nal nodes in the candidate tree).29
Write
T ( ) = arg min C (T )
subtrees T

and let f^ be the corresponding predictor.


The question of how to …nd a subtree T ( ) optimizing C (T ) without mak-
ing an exhaustive search over subtrees for every di¤erent value of has a work-
able answer. There is a relatively small number of nested candidate subtrees
that are the only ones that are possible minimizers of C (T ), and as decreases
one moves through that nested sequence of subtrees from the largest/original
tree to the smallest.
One may quickly search over all "pruned" versions of T0 (subtrees T created
by removing a node where there is a fork and all branches that follow below it)
and …nd the one with minimum
E (T ) E (T0 )
jT0 j jT j
(This IS the per node–of the lopped o¤ branch of the …rst tree–increase in E.)
Call that subtree T1 . T0 is the optimizer of C (T ) over subtrees of T0 for every
(jT0 j jT1 j) = (E (T1 ) E (T0 )), but at = (jT0 j jT1 j) = (E (T1 ) E (T0 )),
the optimizing subtree switches to T1 .
One then may search over all "pruned" versions of T1 for the one with
minimum
E (T ) E (T1 )
jT1 j jT j
and call it T2 . T1 is the optimizer of C (T ) over subtrees of T0 for every
(jT1 j jT2 j) = (E (T2 ) E (T1 )) (jT0 j jT1 j) = (E (T1 ) E (T0 )), but at
= (jT1 j jT2 j) = (E (T2 ) E (T1 )) the optimizing subtree switches to T2 ; and
so on. (In this process, if there ever happens to be a tie among subtrees in
terms of a minimizing a ratio of increase in total training error per decrease
in number of nodes, one chooses the subtree with the smaller jT j.) For T ( )
optimizing C (T ), the function of ,
C (T ( )) = minC (T )
T

is piecewise linear in , and both it and the optimizing nested sequence of


subtrees can be computed very e¢ ciently in this fashion. Figure 29 illustrates
the geometry of the situation. Notice that is a complexity parameter and
jT ( )j is non-decreasing in .
2 9 It is, of course, equivalent to consider a quantity E (T )+ jT j for a > 0 in notation more
like that used in other contexts like the ridge regression problem. Using as a weight on E (T )
to de…ne C , is equivalent to using = 1= as a weight on jT j. The penalized form E (T ) +
jT j is probably more often used (at least for user interface) in software implementations.
The present C is more natural in the context of the development of optimal subtrees.

125
Figure 29: Cartoon of functions of , C (T ) for …xed T , and the optimized
version C (T ( )).

One can then employ K-fold cross-validation to choose as follows. For


each of the K remainders T T k (in the notation of Section 1.3.6)

1. grow an appropriate large tree (on a given dataset), then


2. "prune" the tree in 1. back by for each > 0 (a complexity parameter,
weighting the remainder-in-training-sample error total Ek (for T T k)
against complexity de…ned in terms of tree size) minimizing over choices
of subtrees, the quantity

C k (T ) = jT j + Ek (T )

(for Ek (T ) the error total for the corresponding tree predictor). Write

Tk ( ) = arg min C k (T )
subtrees T

and let f^k be the corresponding predictor.

Then (as in Section 1.3.6), letting k (i) be the index of the fold T k containing
training case i, one computes the cross-validation error
N
1 X
CV ( ) = L f^k(i) (xi ) ; yi
N i=1

For b a minimizer of CV ( ), one then operates on the entire training set,


growing a large tree T and then …nding the subtree, say T (b), optimizing
C b (T ) = jT j + ^ E (T ), and using the corresponding predictor f^^ .

126
9.1.4 Measuring the Importance of Inputs for Tree Predictors
Consider the matter of assigning measures of "importance" of input variables
for a tree predictor. In the spirit of ordinary linear models assessment of the
importance of a predictor in terms of some reduction it provides in some error
sum of squares, Breiman suggested the following. Suppose that in a regression
or classi…cation tree, input variable xj provides the rectangle splitting criterion
for nodes node1j ; : : : ; nodem(j)j and that before splitting at nodelj , the relevant
rectangle Rlj has (for y^lj the prediction …t for that rectangle) associated sum
of training losses X
Elj = L (^
ylj ; yi )
i with xi 2Rlj

1 2
and that after splitting Rlj on variable xj to create rectangles Rlj and Rlj
1 2
(with respective …tted predictions y^lj and y^lj ) one has sums of training losses
associated with those two rectangles
X X
1 1 2 2
Elj = L y^lj ; yi and Elj = y^lj ; yi
1
i with xi 2Rlj 2
i with xi 2Rlj

The reduction in total error provided by the split on xj at nodelj is thus


1 2
Dlj = Elj Elj + Elj

(In regression/SEL contexts, this is a reduction in error sum of squares provided


by the split of Rlj . In 0-1 loss classi…cation contexts it is a reduction in training
set misclassi…cation errors.) One might then take
m(j)
X
Ij = Dlj
l=1

a measure of the importance of xp j in …tting the tree and compare the various
Ij s (or perhaps the square roots, Ij s).
Further, if a predictor is a (weighted) sum of regression trees (e.g. produced
by "boosting" or in a "random forest") and Ijm measures the importance of xj
in the mth tree, then
M
1 X
Ij: = Ijm
M m=1
is perhaps one measure of the importance of xj in the overall predictor. One
can then compare the various Ij: (or square roots) as a means of comparing the
importance of the input variables.

9.2 PRIM (Patient Rule Induction Method)


This is another rectangle-based method of making a predictor on <p . The
language seems to be "patient" as opposed to "rash" and "rule induction" as in

127
"predictor development" or perhaps "conjunctive rule development" from the
context of "market basket analysis." See Section 17.1 in regard to this latter
usage.
PRIM can be thought of as a type of "bump-hunting." For a series of
rectangles (or boxes) in p-space

R1 ; R2 ; : : : ; Rl

one de…nes a predictor


8
>
> y R1 if x 2 R1
>
>
>
> y R2 R1 if x 2 R2 R1
>
> .. ..
>
< . .
f^l (x) = y Rm [ m 1 Rk if x 2 Rm m 1
[k=1 Rk
>
>
>
> ..
k=1
..
>
>
>
> . .
>
: y l
([k=1 Rk )
c = [lk=1 Rk
if x 2

The boxes or rectangles are de…ned recursively in a way intended to catch


"the remaining part of the input space with the largest output values." That
is, to …nd R1

1. identify a rectangle

l1 x1 u1
l2 x2 u2
..
.
lp xp up

that includes all input vectors in the training set,


2. identify a dimension, j, and either lj or uj so that by reducing uj or
increasing lj just enough to remove a fraction (say = :1) of the training
vectors currently in the rectangle, the largest value of

y rectangle

possible is produced, and update that boundary of the rectangle,


3. repeat 2. until some minimum number of training inputs xi remain in the
rectangle (say, at least 10),
4. expand the rectangle in any direction (increase a uj or decrease an lj )
adding a training input vector that provides a maximal increase in y rectangle ,
and
5. repeat 4. until no increase is possible by adding a single training input
vector.

128
This produces R1 . For what it is worth, step 2. is called "peeling" and step 4.
is called "pasting."
Upon producing R1 , one removes from consideration all training vectors
with xi 2 R1 and repeats 1. through 5. to produce R2 . This continues until a
desired number of rectangles has been created. One may pick an appropriate
number of rectangles (l is a complexity parameter) by cross-validation and then
apply the procedure to the whole training set to produce a set of rectangles and
predictor on p-space that is piece-wise constant on regions built from boolean
operations on rectangles.
PRIM is not anywhere near as common as classi…cation and regression trees,
but shares with them some of their attractive features, especially invariance to
monotone transformation of coordinates of an input vector.

10 Predictors Built on Bootstrap Samples


10.1 Bagging in General
One might make B bootstrap samples of N (random samples with replacement
of size N ) from the training set T , say T 1 ; T 2 ; : : : ; T B , and train on these
bootstrap samples using a particular method of prediction to produce, say,

predictor f^ b based on T b

Rather than using these to estimate the prediction error as in Section 16.4,
consider using them to build a predictor.
The possibility considered in Section 8.7 of HTF is the use of bootstrap
aggregation, or "bagging" under SEL. This is use of the predictor
B
1 X ^b
f^bag (x) f (x)
B
b=1

Notice that even for …xed training set T and input x, this is random (varying
with the selection of the bootstrap samples). One might let E denote averaging
over the creation of a single bootstrap sample and f^ be the predictor derived
from such a bootstrap sample and think of

E f^ (x)

as the "true" bagging predictor under SEL (that has the simulation-based ap-
proximation f^bag (x)). One is counting on a law of large numbers to conclude
that f^bag (x) !E f^ (x) as B ! 1. Note too, that unless the operations
applied to a training set to produce f^ are linear, E f^ (x) will di¤er from the
predictor computed from the training data, f^ (x). The primary motivation
for SEL bagging is the hope of averaging (not-perfectly-correlated as they are
built on not-completely-overlapping bootstrap samples) low-bias/high-variance
predictors to reduce variance (while maintaining low bias).

129
A bagged predictor in the 0-1 loss classi…cation case is
B
X h i
f^bag (x) = arg max I f^ b (x) = k
k
b=1

(a majority vote combination of the individual classi…ers). One here expects


that for each k a law of large numbers will imply that

1 X h^ b i h i
B
I f (x) = k ! P f^ (x) = k as B ! 1
B
b=1

so that there is a limiting classi…er


h i
arg maxP f^ (x) = k
k

for which f^bag (x) is a simulation-based approximation.


It is common practice to make a kind of running cross-validation estimate of
error based on "out-of-bag" (OOB) samples as one builds a bagged predictor.
Note that (in cases where all training cases are di¤erent) for large N on average
T b fails to contain about 37% of training cases.30 Then, for each b suppose
one keeps track of the set of (OOB) indices I (b) f1; 2; : : : ; N g for which the
corresponding training vector does not get included in the bootstrap training
set T b . In SEL contexts let
1 X
y^iB = f^ b (xi )
# of indices b B such that i 2 I (b)
b B such that i2I(b)

and in 0-1 loss classi…cation contexts let


X h i
y^iB = arg max I f^ b (xi ) = k
k
b B such that i2I(b)

Then in SEL regression contexts, a running cross-validation type of estimate of


Err is
N
1 X 2
OOB (B) = (yi y^iB )
N i=1
and a corresponding estimate for 0-1 loss classi…cation contexts is
N
1 X
OOB (B) = I [yi 6= y^iB ]
N i=1

One then expects the convergence of OOB(B), and plotting of OOB(B) versus
B is a standard way of trying to assess whether enough bootstrap samples have
3 0 The probability that a particular training case is missed in a bootstrap sample is
1 N 1 N e 1 :37 for N of any reasonable size.

130
been made to adequately represent the limiting predictor. In spite of the fact
that for small B the (random) predictor f^B is built on a small number of samples
trees and is fairly simple, B is not really a complexity parameter, but is rather
a convergence parameter.
Where losses other than SEL or 0-1 loss are involved, exactly how to "bag"
bootstrapped versions of a predictor is not altogether obvious, and apparently
even what might look like sensible possibilities can do poorly.

10.2 Random Forests: Special Bagging of Tree Predictors


This is an elaboration of the "bagging" (bootstrap aggregation) idea of Section
10.1 applied speci…cally to (regression and classi…cation) trees. For each one of
B bootstrap samples of N (from the training set T ), T b , develop a corresponding
regression or classi…cation tree by

1. at each node, randomly selecting m of the p input variables and …nding an


optimal single split of the corresponding rectangle over the selected input
variables, splitting the rectangle, and

2. repeating 1 at each node up to a …xed depth or until no single-split im-


provement in splitting criterion is possible without creating a rectangle
with less than a small number of training cases, nm in .

(Note that no pruning is applied in this development.) Then let f^ b (x) be the
corresponding tree-based predictor (taking values in < in the regression case or
in G = f1; 2; : : : ; Kg in the classi…cation case). A random forest predictor in
the regression case is then
B
1 X ^b
f^B (x) = f (x)
B
b=1

and a 0-1 loss random classi…er is


B
X h i
f^B (x) = arg max I f^ b (x) = k
k
b=1

(This is a "majority vote" of the B constituent classi…cation trees.)


As we have noted before in reference to nearest neighbor classi…cation and
classi…cation trees, it can be more important in K-class classi…cation models to
estimate P [y = kjx] than it is to approximate the optimal classi…er at x. If,
for tree b in a forest of B such trees,

I b (x) = the set of indices of training cases with xi in the same rectangle as x

then P
xi 2I b (x) I [yi = k]
p^bk (x) =
# [xi 2 I b (x)]

131
(the fraction of training cases with xi in the same rectangle as x and yi = k)
estimates this probability using tree b. Then one random forest estimate of
P [y = kjx] is the simple average
B
1 X b
p^k (x)
B
b=1

An alternative possibility is the weighted average


PB P
b
b=1 # xi 2 I (x) p^bk (x) x 2I b (x) I [yi = k]
PB = PB i
b b
b=1 # [xi 2 I (x)] b=1 # [xi 2 I (x)]

The basic tuning parameters in the development of f^B (x) are then m, and
nm in , and (if used) a maximum tree depth. Standard default values of parame-
ters are

m = bp=3c and nm in = 5 for regression problems, and


p
m= p and nm in = 1 for classi…cation problems.

The default nm in = 1 for classi…cation problems means that splitting termi-


nates only because of reaching a maximum depth or the impossibility of reducing
the splitting criterion with a single additional split. In the event that the max-
imum tree depth really doesn’t come into play (because it is set to some value
that is large in relative terms) this will produce random forest classi…ers with
0 training error rate. (Any given training case will be missed by only about
37% of B bootstrap samples, so that about 63% of the B bootstrap samples will
produce a tree correctly classifying the case, and so majority voting means that
the random forest will correctly classify the case.) But notice that this does
not imply that the out-of-bag-error OOB(B) will be 0. And it does not imply
that OOB(B) for large B is unreliable as an indicator of the likely performance
of a random forest classi…er. It only implies that the training error rate is
completely unreliable as an indicator of random forest classi…er e¢ cacy.
There is a fair amount of confusing discussion in the literature about the
impossibility of a random forest "over…tting" with increasing B. This seems to
be related to test error not initially-decreasing-but-then-increasing-in-B (which
is perhaps loosely related to OOB(B) converging to a positive value associated
with the limiting predictor f^rf (and not showing such behavior) and/or 0 training
error rate for a random forest classi…er not implying over…t31 ). But as HTF
point out on their page 596, it is an entirely di¤erent question as to whether
f^rf itself is "too complex" to be adequately supported by the training data, T .
(And the whole discussion seems very odd in light of the fact that for any …nite
B, a di¤erent choice of bootstrap samples would produce a di¤erent f^B as a
new randomized approximation to f^rf . Even for …xed x, the value f^B (x) is a
3 1 For many predictors/classi…ers a 0 training error does suggest over-…t, but not necessarily

for random forest classi…ers.

132
random variable. Only f^rf (x) is …xed.) The fact that the out of bag error will
increase if optimal allowable tree complexity (encoded in nm in and tree depth)
and/or optimal m are exceeded means that a random forest f^rf (x) can indeed
over…t (be too complex for the real information content of the training set).
There is also a fair amount of confusing discussion in the literature about
the role of the random selection of the m predictors to use at each node-splitting
(and the choice of m) in reducing "correlation between trees in the forest." The
Breiman/Cutler web site http://www.stat.berkeley.edu/~breiman/Random
Forests/cc_home.htm says that the "forest error rate" (presumably the error
rate for f^rf ) depends upon "the correlation between any two trees in the forest"
and the "strength of each tree in the forest." The meaning of "correlation" and
"strength" is not clear if anything technical/precise is intended. One possibility
for the …rst is some version of correlation between values of f^ 1 (x) and f^ 2 (x)
as one repeatedly selects the whole training set T in iid fashion from P and then
makes two bootstrap samples— Section 15.4 of HTF seems to use this meaning.32
A meaning of the second is presumably some measure of average e¤ectiveness of
a single f^ b . HTF Section 15.4 goes on to suggest that increasing m increases
both "correlation" and "strength" of the trees, the …rst degrading error rate
and the second improving it, and that the OOB estimate of error can be used
to guide choice of m (usually in a broad range of values that are about equally
attractive) if something besides the default is to be used.

10.3 Measuring the Importance of Inputs for Bagged Pre-


dictors
An idea of Breiman (phrased originally for random forests, but relevant to any
bagged predictor) is this. For every bootstrap sample T b and predictor f^ b
based on the corresponding remainder T T b , one can compute a bth average
error across the corresponding OOB sample, say
1 X
errb = L f^ b (xi ) ; yi
case i is not in the
# ij i s.t. case i is not in the
bootstrap sample b the b o otstrap sam ple b

Then in the OOB sample randomly permute the values of the jth coordinate of
~ji . One can then de…ne
the input vectors, producing, say, input vectors x

j 1 X
fb =
err L ~ji ; yi
f^ b x
case i is not in the
# ij i s.t. case i is not in the
bootstrap sample b the b o otstrap sam ple b

and take the di¤erence


j
Ibj = err
fb errb (112)
3 2 A second possibility concerns "bootstrap randomization distribution" correlation (for a

…xed training set and a …xed x) between values f^ 1 (x) and f^ 2 (x).

133
as an indicator (for the bth bootstrap sample) of the importance of variable j
to prediction. These can then be averaged across the B bootstrap samples to
produce
B
1 X j
Ij = Ib (113)
B
b=1

as a variable importance measure for variable j, and compared across j. (Typ-


ically these will be positive and large values are indicative of high variable
importance.)
When applied to its specially constructed trees, these ideas produce a vari-
able importance measure for a random forest. It is worth saying that what is
made is then something di¤erent than what was suggested at the end of Section
9.1.4 for a predictor that is ultimately an average of tree predictors (that could
also be employed for the random forest).

10.3.1 The Boruta Wrapper/Heuristic for Variable Selection


A methodology of Kursa and Rudnicki for identi…cation of all coordinates of
an input that have "statistically detectable" variable importance builds on the
importance measure I j in display (113), usually derived from random forests.33
It is aimed at judging which I j s are "clearly more than noise." To enable this,
when r predictors are currently under consideration some (say s max (5; r))
additional "shadow" (plausible noise) predictors are considered along with the
actual predictors. These shadow predictors are made by randomly permuting
entries in columns of the original input matrix for the predictors under con-
sideration. These "should" prove to be of no importance in the prediction of
y.
Boruta operates in stages in a "backwards elimination" fashion, beginning
with consideration of all p original predictors and at a given stage dropping from
the set of remaining potentially important variables those that are "clearly no
better" than the best shadow variable at the stage. What is done to make
decisions about elimination is to consider the set of values Ibj de…ned in display
(112) for a given j (newly indexing both those actual predictors still under
consideration as 1; : : : ; r and those shadow predictors newly generated at the
beginning of the stage as r + 1; : : : ; r + s), and compute both their mean I j
(in expression (113)) and their sample standard deviation, call it S j . Some
kind of rough test of "statistical signi…cance" based on comparison of the scores
(possibly accumulated across stages)

Ij
Zj
Sj
for real inputs (i.e. for j = 1; : : : ; r) against

max Z j
j=r+1;:::;r+s

3 3 Boruta is the name of the mythological Slavik god of the forest.

134
The elimination process is intended to ultimately drop from consideration all
those predictors whose scores are not clearly bigger than those of (by construc-
tion useless) shadow predictors.
This is, of course, a heuristic and exact details vary with implementation.
But the central idea is above and makes sense. It can be applied to any bagging
context, and variants of it could be applied where one is not bagging, but other
forms of holding out a test set are employed. Typically, the prediction method
used is the random forest, because of its reputation for broad e¤ectiveness and
its independence of scaling of coordinates of the input. But there is nothing
preventing its use with, say, a linear prediction or smoothing methodology.

10.4 Bumping and "Active Set Selection"


Another/di¤erent thing one might do with bootstrap versions of a predictor is
to "pick-a-winner" based on performance on the training data. This is the
"bumping"/stochastic perturbation idea of HTF’s Section 8.9. That is, let
f^ 0 = f^ be the predictor computed from the training data, and de…ne
N
X 2
bb = arg min yi f^ b (xi )
b=0;1;:::;B i=1

and take
b
f^bum p (x) = f^ b (x)
The idea here is that if a few cases in the training data are responsible for making
a basically good method of predictor construction perform poorly, eventually a
bootstrap sample will miss those cases and produce an e¤ective predictor.
Rick (Wen) Zhou in his ISU PhD dissertation made another use of bootstrap-
ping, motivated by a real 2-class classi…cation problem with "covariate shift."
x values in an important test set were mostly unlike input vectors xi available
in a fairly small training set. With relatively little information available in the
training set, highly ‡exible methods like nearest neighbor classi…cation seemed
unlikely to be e¤ective. But a single simple application of a less ‡exible method-
ology (like one based on logistic regression) also seemed unlikely to be e¤ective,
because most test case input vectors were "near" at most "a few" training case
input vectors and extrapolation of some kind was unavoidable.
What Zhou settled on and ultimately found to be relatively e¤ective was to
use (locally de…ned) bootstrap classi…ers based on weighted bootstrap samples,
with weights chosen to depend upon x at which one is classifying. For a test
input vector x 2 <p de…ne weights for training case inputs xi by
2
wi (x) = exp kx xi k
PN
for some appropriate > 0. For w (x) = i=1 wi (x) a single "weighted
bootstrap" sample tailored to the input x can be made by sampling N training
cases iid according to the distribution over i = 1; 2; : : : ; N with probabilities

135
pi (x) = wi (x) =w (x). Upon …tting a simple form of classi…er to B such tailored
samples and using majority voting of those classi…ers, one has a classi…cation
decision for input x. It is one that respects both the likelihood that training
cases close to the input are most relevant to decisions about its likely response
and the need to enforce simplicity on the prediction.

11 "Ensembles" of Predictors
Bagging combines an "ensemble" of predictors consisting of versions of a single
predictor computed from di¤erent bootstrap samples. An alternative might be
to somehow weight together (or otherwise combine) di¤erent predictors (poten-
tially even based on di¤erent models or methods). Here we consider 3 versions
of this basic idea of somehow combining an ensemble of predictors to produce
one better than any element of the ensemble.

11.1 Bayesian Model Averaging for Prediction


One theoretically straightforward way to justify this kind of enterprise is through
the Bayes "multiple model" scenario (also used in Section 16.2.2). Suppose that
M models P1 ; P2 ; : : : ; PM for (x; y) are under consideration, the mth of which
has parameter vector m and corresponding density pm (x; yj m ). Then for
the mth model (repeatedly abusing notation by using p to name many di¤erent
functions) the training set T has density
N
Y
pm (T j m) = pm (xi ; yi j m)
i=1

We’ll suppose here that m is not known and that it has prior density gm ( m)
(for the mth model) and that and a prior probability for model m is

(m)

Then a joint distribution for m; m; T ; and (x; y) has density

pm (x; yj m ) pm (T j m ) gm ( m) (m)

This has a marginal density for y; x; T that is


M
X Z
(m) pm (x; yj m ) pm (T j m ) gm ( m) d m
m=1

from which the conditional mean of yjx; T is


PM RR
m=1 (m) ypm (x; yj m ) pm (T j m ) gm ( m ) d m dy
E [yjx; T ] = PM RR
m=1 (m) pm (x; yj m ) pm (T j m ) gm ( m ) d m dy

136
Given m (the identity of the "correct" model) the variables T ; m; and (x; y)
have joint density
pm (x; yj m ) pm (T j m ) gm ( m )
for which the conditional mean of yjx; T ; m is, say,
RR
ypm (x; yj m ) pm (T j m ) gm ( m ) d m dy
E [yjx; T ; m] = R R
pm (x; yj m ) pm (T j m ) gm ( m ) d m dy

so that
Z Z
ypm (x; yj m ) pm (T j m ) gm ( m ) d m dy
Z Z
= E [yjx; T ; m] pm (x; yj m ) pm (T j m ) gm ( m ) d m dy

from whence
PM RR
m=1 E [yjx; T ; m] (m) pm (x; yj m ) pm (T j m ) gm ( m ) d m dy
E [yjx; T ] = PM RR
m=1 (m) pm (x; yj m ) pm (T j m ) gm ( m ) d m dy

This is the average of E[yjx; T ; m] with respect to the conditional distribution


(the "posterior" distribution) of mjx; T speci…ed by
RR
(m) pm (x; yj m ) pm (T j m ) gm ( m ) d m dy
(mjx; T ) = PM RR
m=1 (m) pm (x; yj m ) pm (T j m ) gm ( m ) d m dy

That is, optimal SEL prediction of y proceeds by weighting what would be


optimal predictors of y from the M constituent models by the relevant (up-
dated from (m) by the information in x and T about the relevant density
pm (x; yj m )) conditional probabilities of the M components. This is "Bayes
model averaging."
Essentially the same argument pertains in cases where y takes values in
G = f1; 2; : : : ; Kg and 0-1 loss is involved. Under the same model as above,
P [y = kjx; T ] is a (mjx; T )-weighted average of P [y = kjx; T ; m]s appropri-
ate under the M constituent models. (Of course, integrals "dy" are sums.)
Ultimately, optimal 0-1 loss classi…ers then choose for input x (and training set
T ) the class k maximizing this Bayes model average probability.
These developments of Bayes model averaging predictors explicitly involve
x in the posterior distribution of m (given x and T ). This is because if one
thinks of a new x and corresponding y as generated by the same mechanism that
produces T , the observed x is informative about m. Another way of modeling
and calculating is the following.
One might suppose that the functions of x,
RR
ypm (x; yj m ) gm ( m ) d m dy
m (x) = R R
pm (x; yj m ) gm ( m ) d m dy

137
or R
pm (x; yj m ) gm ( m ) d m
pm (yjx) = PK R
y=1 pm (x; yj m ) gm ( m ) d m

are objects of interest, but without a necessary connection to a speci…c new


observation x, itself informative about m and m . (These functions are the
conditional means of and densities for y given x under particular models m.)
Positing a distribution speci…ed by
pm (T j m ) gm ( m) (m)
for m; m ; T in the multiple model scenario, the posterior distribution for m
given T has pmf
R
(m) pm (T j m ) gm ( m ) d m
(mjT ) = PM R
m=1 (m) pm (T j m ) gm ( m ) d m

So the posterior mean of m (x) given T is


M
X
m (x) (mjT )
m=1

and the posterior mean of pm (yjx) given T is


M
X
pm (yjx) (mjT )
m=1

These di¤er from the previous "Bayes model averages," but they also represent
sensible ensembles of predictors appropriate in the constituent models.

11.2 Stacking: SEL ... and 0-1 Loss


The Bayes model averaging idea is theoretically unimpeachable, but rarely prac-
tical. It does, however, raise the question "What might be suggested like this,
but with a less Bayesian ‡avor?" One line of thinking is as follows.
Suppose that M SEL predictors are available (all based on the same training
data), f^1 ; f^2 ; : : : ; f^M . One might seek a weight vector w for which the predictor
M
X
f^ (x) = wm f^m (x)
m=1

is e¤ective. Why this can improve on any single one of the f^m s is in some
sense, this is "obvious." The set of possible w (over which one searches for good
weights) includes vectors with one entry 1 and all others 0. But to indicate
in a concrete setting why this might work, consider a case where M = 2 and
according to the P N P joint distribution of (T ; (x; y))

E y f^1 (x) = 0

138
and
E y f^2 (x) = 0

De…ne
f^ = f^1 + (1 ) f^2
Then
2 2
E y f^ (x) =E y f^1 (x) + (1 ) y f^2 (x)

= Var y f^1 (x) + (1 ) y f^2 (x)

y f^1 (x)
= ( ;1 ) Cov
y f^2 (x) 1

This is a quadratic function of , that (since covariance matrices are non-


negative de…nite) has a minimum. Thus there is a minimizing that typically
(is not 0 or 1 and thus) produces better expected loss than either f^1 (x) or
f^2 (x).
More generally, again using the P N P joint distribution of (T ; (x; y)), one
0 0
may consider the random vector f^1 (x) ; f^2 (x) ; : : : ; f^M (x) ; y = f^ ; y and
let
0
E f^f^ and Ey f^
M 1
M M

be respectively the matrix of expected products of the predictions and vector


of expected products of y and elements of f^. Upon writing out the expected
square to be minimized and doing some matrix calculus, it’s possible to see that
optimal weights are of the form
0 1
wopt = E f^f^ Ey f^

Of course, this isn’t usable in practice, as the mean vector and expected cross
product matrix are unknown.
One practical possibility is to "pick-a-winning" w on the basis of LOO cross-
validation. That is, for f^mi
the mth predictor …t to the training set with the
ith case removed,

N M
!!2
1 X X
yi w0 + wm f^m
i
(xi )
N i=1 m=1

is a LOOCVE for the predictor


M
X
f^ (x) = w0 + wm f^m (x)
m=1

139
that could be optimized as a function of w = (w0 ; w1 ; : : : ; wM ) to produce
wstack and the "stacked" predictor
M
X
f^ (x) = w0 + stack ^
wm fm (x)
m=1

An ad hoc version of stacking-type averaging is choice of weight vector w


based on informal consideration of one’s (CV-supported) evaluation of the e¤ec-
tiveness of the individual predictors f^m (x) and the (training set) correlations
between them. (Averaging multiple highly correlated predictors can’t be ex-
pected to be particularly helpful, and individually e¤ective predictors should
get more weight than those that are relatively speaking ine¤ective.)
The application of the general notion of stacking to 0-1 loss classi…cation
has typically been treated on a very informal and ultimately unprincipled basis.
Probably the most common suggestion extant in the machine learning world is to
make classi…cations on a (potentially weighted) "majority vote" of an ensemble
of classi…ers. This is completely unsupported by any sensible theory. In this
regard, see Vardeman and Morris "Majority Voting by Independent Classi…ers
can Increase Error Rates" that appears in The American Statistician in 2013
and their "Reply" to comments on the paper by Baker and others that appeared
in the same journal in 2014.
A principled line of reasoning for the classi…cation case is this. If a 0-1 loss
classi…er is any good, it is an approximation of the optimal form (28). So if it
has an underlying voting function, that voting function must be equivalent to
(must be a monotone transform of) an approximate likelihood ratio. What one is
trying to do is …nd a better approximate likelihood ratio by combining several of
these. It is then sensible to use underlying voting functions for the classi…er (and
the classi…ers themselves in cases where no such voting function is available) as
features input into a tree-based classi…cation methodology (tree-based because
of invariance to monotone transformation of coordinates of inputs and the fact
that constituent voting functions are potentially on completely di¤erent scales,
e.g. in some cases involving approximations for linear functions of P [y = 1jx]
and in others approximations for L (x) directly). Details of sensible cross-
validation to choose parameters of the constituent classi…cation methods and
the …nal tree-based method in this context remain to be considered. But the
basic approach is clear and principled.

11.3 "Generalized Stacking" and "Deep" Structures for


Prediction
Suppose that M predictors f^1 ; f^2 ; : : : ; f^M (all based on the same training data)
are available. We might call them together an "ensemble" of predictors and
hope to make from them a single predictor that is more e¤ective than any of
the constituents. We have just said that for SEL prediction a linear combi-
nation of these (a "stacked" predictor) is one way of making such a predictor.
We have also said that for 0-1 loss classi…cation, combining multiple classi…ers

140
through a tree-based function of their voting functions seems likely to be gen-
erally practically e¤ective.34 Here we consider the general problem "predictor
combination." The primary contribution it potentially o¤ers is reduction of
model bias by adding ‡exibility not provided by any individual f^m .
One important way to view the stacked SEL predictor
M
X
w0 + wm f^m (x) (114)
m=1

is as a linear predictor based on M new "features" that are the values of the
ensemble. That suggests applying some standard predictor methodology to a
"training set" consisting of M vectors of predictions ... with or without some or
all of the original input variables also reused as inputs. The generalization of
ordinary stacking is

f~ (x) = f^ f^1 (x) ; f^2 (x) ; : : : ; f^M (x) ; x (115)

for some appropriate prediction algorithm f^. As this is more general than
ordinary stacking, it has the potential to be even more e¤ective than a linear
combination of the M predictors could be in SEL problems and is applicable to
other prediction problems.
(Generalized) Stacking is a big deal. From the earliest of the public
predictive analytics contests (the Net‡ix Prize contest run 2006-2009) it has
been common for winning predictions to be made by "end-of-game" merging of
e¤ort by two or more separate teams that in some way combine their separate
predictions. More and more references are made on contest forums to various
strategies for combining basic predictors. Multiple-level versions of the stacking
structure are even discussed.35
While the success of some (?luckiest among a number of?) ad hoc choices
of generalized stacking forms in particular situations is undeniable, principled
choices of forms and parameters for f^ (and indeed f^1 ; f^2 ; : : : ; f^M ) in display
(115) involve both logical subtleties and huge computational demands. As
always, cross-validation (or perhaps its OOB relative in the event that bagging
is involved) is the only sound basis of these choices (and subsequent assessment
of the implications of the choices).
Consider …rst a version of this problem where associated with each f^m and
with the top-level form f^ are grids of possible values of parameters and a (po-
tentially huge) product grid is searched for a best cross-validation error (and
ultimately the optimizing parameter vector is applied to make the pick-the-
winner meta-predictor (115)). For each (vector) element of the product grid,
a cross-validation error is created by holding out folds and …tting f^m s and f^
with the prescribed parameter values on the remainders and testing on the cor-
responding folds. This is a perfectly defensible strategy for choosing a version
3 4 There is, unfortunately, a large and very confused "theoretical" literature on "classi…er

fusion" mostly built around the ad hoc notion of combination via majority voting.
3 5 In truth, they are but structured versions of the general form (115)

141
of predictor form (115). But notice that exactly as discussed in Section 1.3.7,
the "winning" cross-validation error is not an honest indicator of the likely per-
formance of the grid point/predictor ultimately chosen. In order to honestly
estimate Err for the prediction methodology employed, one must cross-validate
the whole process. In each of K remainders one would need to make grids and
cross-validation errors for each grid point and pick a winner to predict on the
corresponding fold in order to produce a cross-validation error for the pick-the-
winner strategy. This implies a large computational load (especially if repeated
cross-validation is done) in order to choose a …nal version of super-learner for
application and assess the e¤ectiveness of the process that produced it.
A second version of this scenario might pertain where ultimately individually-
"optimized" (perhaps by cross-validation across some grid of parameter values
for each m) versions of the f^m s will be combined into a form (115) and choice of
complexity parameters for f^ then made by applying another subsequent "cross-
validation," treating the chosen forms for the f^m as …xed. The only way to
assess the potential performance of this way of predicting is to do it (K times)
on K folds and remainders. That is, within each of K remainders the whole
sequence of choosing parameters for the f^m s and subsequently for the f^ must
be repeated (by making K folds and remainders within each remainder ...
surely leading to di¤erent "best" vectors of parameters for each fold) and applied
to the corresponding fold to …nally get a cross-validation error.
In both of these scenarios, it is clear that computation grows rapidly with
the complexity of constituent predictor forms, the breath of the optimization
desired, and the extent to which repetition of cross-validation is used.
What kind of top-level f^ should be used in predictor form (115) could be
investigated by comparison of cross-validation errors. The linear form (114) is
most common and (at least in its ad hoc application) famously successful. But
there is a very good case to be made that a random forest form has potential to
be at least as e¤ective in this role. Its invariance to scale of its inputs (inherited
from its tree-based heritage) and wide success and reputation as an all-purpose
tool make it a natural candidate.
Neural networks have the kind of "(potentially repeated) composition of mul-
tiple functions of the input vector" character evident in the form (115). That
realization perhaps motivates consideration of versions of generalized stacking
where the ensemble of predictors f^1 ; f^2 ; : : : ; f^M itself has some speci…c kind of
"neural-network-like" structure behind it. Figure 30 is a graphical representa-
tion of what is possible.
It is not at all obvious whether a neural-network-like structure for an ensem-
ble of predictors in generalized stacking is necessarily helpful in practical pre-
diction problems. The folklore in predictive analytics is that ordinary stacking
is most helpful where elements of an ensemble have small correlations. (Obvi-
ously, if they are perfectly correlated no advantage can be gained by "combin-
ing" them.) How that folklore interacts with the current popularity of "deep
learning" methods is unclear. One thing that is clear is that unthinking prolif-
eration of "layers" in development of a predictor where they really add nothing

142
Figure 30: An L-layer structure for prediction based on x.

to the empirical approximation of an optimal predictor can only exacerbate the


computational problems of cross-validation and facilitate unwitting over…tting.

11.4 Boosting/Successive Approximation


11.4.1 SEL Boosting
A di¤erent line of thinking that leads to the use of weighted linear combinations
of predictors is called boosting. The original classi…cation version of the idea
produces the famous "AdaBoost.M1" method discussed in Section 11.4.4. This
methodology is really just an instance of the basic numerical analysis notion of
successive approximation to …nd a solution to an equation or an optimizer
of a functional.
There is general gradient boosting. But we begin with the SEL special
case, because this version is both particularly easy to understand and explain
and of high practical value. The basic idea is to repeatedly try to improve
an approximator for E[yjx] by successively adding small corrections (based on
modeling current residuals) to current approximators.
SEL boosting begins with some predictor f^0 (x) (like, e.g., f^0 (x) = y). With
an iterate f^m 1 (x) in hand, one …ts some SEL predictor, say e^m (x), to the N
"data pairs" xi ; yi f^m 1 (xi ) consisting of inputs and current residuals.
(Typically, some very simple/crude/non-complex "base predictor" form is used
for e^m .) Then, for some "learning rate" 2 (0; 1), one sets

f^m (x) = f^m 1 (x) + e^m (x)

One iterates on m through some number of iterations, M (possibly chosen by


cross-validation). Commonly quoted choices for are numbers like :01 and the
smaller is , the larger must be M . (Note that could be allowed to depend
upon m, in which case notation like m would be appropriate above.)
SEL boosting successively corrects a current predictor by adding to it some
small fraction of a predictor for its residuals. The value functions as a com-

143
plexity or regularizing parameter, as does M . Small and large M correspond
to large complexity. The boosting notion is di¤erent in spirit from stacking or
model averaging, but like them ends with a linear combination of …tted forms
as a …nal predictor/approximator for E[yjx].
This kind of sequential modi…cation of a predictor is not discussed in or-
dinary regression/linear models courses because if a base predictor is an OLS
predictor for a …xed linear model, corrections to an initial …t based on this same
model …t to residuals will predict that all residuals are 0. In this circumstance
boosting does nothing to change or improve an initial OLS …t.

11.4.2 General "Gradient Boosting"


Now consider approximate empirical optimization (over choice of real-valued
function g) of
EL (g (x) ; y)
through (successive approximation) search for predictor f^ that optimizes
N
X
L f^ (xi ) ; yi = N err (116)
i=1
PN
One begins with some predictor f^0 (x) (like, e.g., f^0 (x) = arg min i=1 L (b
y ; yi )).
b
y
With an iterate f^m 1 (x) in hand, one then might consider how to improve the
PN
current total training set loss i=1 L f^m 1 (xi ) ; yi . Let

@
yeim = L (b
y ; yi ) (117)
@b
y b=f^m
y 1 (xi )

These values are the elements of the negative gradient of total loss with respect
to the current predictions for the training set. Ideally, one would like to correct
f^m 1 (x) in a way that moves each prediction of a training output f^m 1 (xi )
by more or less a common multiple of yeim . To that end, one …ts some SEL
predictor, say e^m (x), to "data pairs" (xi ; yeim ). (As in the special case of
SEL boosting, typically some very simple/crude/non-complex form of "base
predictor" is used for e^m .) Let m > 0 (controlling the "step-size" in modifying
f^m 1 (x)) stand for a multiplier for e^m (x) such that
N
X
L f^m 1 (xi ) + me
^m (xi ) ; yi
i=1

is small (ideally, minimum). (Unless an analytical formula for an optimal m


is obvious, some kind of numerical line search is implicit in the good choice of
m .)
Then, for some "learning rate" 2 (0; 1), one sets

f^m (x) = f^m 1 (x) + me


^m (x) (118)

144
as an approximate "steepest descent" correction. Of course, other criteria
besides SEL (like AEL) could be used in …tting e^m (x) and could be allowed
to change with m.
The development here allows for arbitrary base predictors. But for good rea-
sons (especially the fact that trees are invariant to monotone transformations of
coordinates of x) the functions e^m are often rectangle-based (and even restricted
to single-split-trees in the case of AdaBoost.M1). If a tree-building algorithm
for approximating the values (117) produces a set of non-overlapping rectan-
gles R1 ; R2 ; : : : ; RL that cover the input space, rather than using for e^m (x) in
rectangle Rl some average of the values yeim for training cases with xi 2 Rl , it
makes more sense to use
X
e^m (x) = arg min L f^m 1 (xi ) + c; yi for x 2 Rl (119)
c
i s.t. xi 2Rl

and m = 1 and this is the form typically used in gradient boosting with trees.
Update form (119) relies upon 1-dimensional optimizations of a sum of losses
for training inputs in L tree-generated rectangles. Another way this idea can
be used is with rectangles formed based on values of sub-vectors of x with …nite
numbers of possible values. That is, consider again the context of Section 1.4.2.
For a given choice of D categorical, ordinal, or …nite-discrete coordinates of x
de…ning the sub-vector x, consider using
X
e^m (x) = arg min L f^m 1 (xi ) + c; yi for x with xi = x
c
i s.t. xi =x

and m = 1. This e^m (x) has only a …nite number of possible values, one
corresponding to each of the sets fijxi = xg. Further, in contexts where there
are a number of potential choices of such sets of discrete coordinates of x, the
total losses after update (118) can be compared to choose a good sub-vector x
to use to produce f^m .

SEL We had a …rst look at SEL boosting in Section 11.4.1. To establish that it
2
is a version of gradient boosting, simply suppose now that L (by ; y) = 12 (b
y y) .
Then
@ 1 2
yeim = (b
y yi ) = yi f^m 1 (xi )
@b
y 2 b=fm 1 (xi )
y

and for SEL the general gradient boosting corrections are indeed based on the
prediction of ordinary residuals.

AEL (and Binary Regression Trees) Suppose now that L (b y ; y) = jb


y yj :
Then, beginning from f^0 (x) (say f^0 (x) = median fyi g),

@
yeim = (jb
y yi j) = sign yi f^m 1 (xi )
@b
y b=f^m
y 1 (xi )

145
So the gradient boosting update step is "…t a SEL predictor for 1s coding the
signs of the residuals from the previous iteration." In the event that the base
predictors are regression trees, the e^m (x) in a rectangle will be a median of 1s
coming from signs of residuals for cases with xi in the rectangle (and thus have
value either 1 or 1, constant on the rectangle).

Standard Voting Functions for 2-Class Classi…cation Referring again


to the development in Section 1.5.3, recall that approximation to optimal voting
functions g (x) can produce approximately optimal 2-class classi…ers sign(g (x)).
Then consider h1 (u) = ln (1 + exp ( u)) = ln (2) and the loss

L (g (x) ; y) = h1 (yg (x)) = ln (1 + exp ( yg (x))) = ln (2)

For this situation


0 1
@ 1 @ f^m 1 (xi ) exp ( yi y^i ) A
yeim = (ln (1 + exp ( yi y^)) = ln (2)) =
@b
y b=f^m
y 1 (xi )
ln 2 1 + exp yi f^m 1 (xi )

and corresponding boosting can be expected to produce a voting function ap-


proximating
P [y = 1jx]
g (x) = ln
P [y = 1jx]
For the exponential function h2 (u) = exp ( u) and loss L (g (x) ; y) =
h2 (yg (x)) one has

@
yeim = exp ( yi y^) = yi exp yi f^m 1 (xi )
@b
y b=f^m
y 1 (xi )

and corresponding boosting produces a voting function approximating 12 g (x)


(for g (x) above). (For the choice of base predictors as single-split trees,
gradient boosting would be an approximate version of the famous AdaBoost.M1
algorithm.)
Finally, for the hinge function h3 (u) = (1 u)+ and loss L (g (x) ; y) =
h3 (yg (x)), one gets

@ h i
yeim = (1 yi y^)+ = yi I yi f^m 1 (xi ) < 1
@b
y b=f^m
y 1 (xi )

and corresponding boosting produces a voting function approximating the op-


timal classi…er directly.

K-Class Classi…cation Models We noted in Section 1.3.2Pthat in a K-class


K
classi…cation model, under the cross-entropy loss L (^ y ; y) = k=1 I [y = k] ln (^
yk ),
for non-negative y1 ; y2 ; : : : ; yK summing to 1, predictors

fk (x) = P [y = kjx]

146
are optimal and can be used to produce optimal 0-1 loss classi…ers. Consider
boosting to produce approximations to f1 (x) ; f2 (x) ; : : : ; fK 1 (x). Begin with
PK 1
K 1 positive predictors f^10 (x) ; f^20 (x) ; : : : ; f^(K 1)0 (x) with k=1 f^k0 (x) <
1. (For example, f^k0 (x) = 1=K will serve.) Then for y^1 ; y^2 ; : : : ; y^K 1 positive
with sum less than 1, with
K
!
X1 KX1
L (^
y ; y) = I [y = k] ln (^yk ) I [y = K] ln 1 y^k
k=1 k=1

let (for k = 1; 2; : : : ; K 1)

@
yeikm = L (^
y ; yi )
@b
yk b k =f^m
y 1 (xi )

1 1
= I [yi = k] I [yi = K] PK 1
f^k(m 1) (xi ) 1 k=1 f^k(m 1) (xi )

For each k …t some SEL predictor, say e^km (x), to pairs (xi ; yeikm ) and for an
appropriate m > 0 set

f^km (x) = f^k(m 1) (x) + me


^km (x)

( m will need to be chosen to be small enough that all f^1m (x) ; f^2m (x) ; : : : ; f^(K 1)m (x)
remain positive with sum less than 1.)

11.4.3 Some Issues Related to Boosting Practice


Here we consider several issues that arise in the use of boosting. These mostly
concern control of complexity of predictors in boosting.
Where trees are used to create the functions e^m , there is the question of how
large they should be allowed to grow. The answer seems to be "Not too large,
maybe to about 6 or so terminal nodes." Another (probably better) approach
to this question would seem to be to grow large trees and then employ cost-
complexity pruning, ultimately using cross-validation to choose a value for the
weight (or = 1= ).
There is always question of the number of boosting steps, M , that should
be employed. This can/should be limited in size (very large values surely
producing over…t). Holding back a part of the training sample and watching
performance of a predictor on that single test set as M increases is a possible
crude method of choosing M . Presumably, cross-validation provides a more
reliable means of directing choice of M:
"Shrinkage" also impacts …nal boosted predictor complexity. In choosing
2 (0; 1) for use in update (118) one chooses a multiplier of e^m (x) strictly less
than one that minimizes the updated total loss. That is,. one doesn’t make
the "full correction" to f^m 1 in producing f^m . The smaller is this parameter
the larger will be M needed for good predictor performance. One might well
choose both and M via cross-validation.

147
"Subsampling" or "stochastic boosting" is the practice of at each iteration
of boosting, instead of choosing an update based on the whole training set,
choosing a fraction of the training set at random and …tting to it (using a
new random selection at each update). This reduces computation time per
iteration and can also improve predictor performance (primarily by reducing
over…t?). Once more, cross-validation can inform the choice of .
A very popular implementation of gradient boosting goes by the name "XGBoost"
(for "eXtreme Gradient Boosting"). This is an R package (with similar imple-
mentations in other systems) that provides a lot of ‡exibility and code that is
very fast to run (even providing parallelization where hardware supports it).
The caret package can be used to do cross-validation based on XGBoost, allow-
ing one to tune on a number of algorithm complexity parameters.

11.4.4 AdaBoost.M1
Consider a 2-class 0-1 loss classi…cation problem with 1=1 coding of output y
(y takes values in G = f 1; 1g). The AdaBoost.M1 algorithm is an exact variant
of the (approximate) gradient boosting algorithm, but is usually described in
other terms. We describe those terms next, and then make the connection to
general boosting.
The standard/original description of the AdaBoost.M1 algorithm is as fol-
lows.

1. Initialize weights on training data (xi ; yi ) at


1
wi1 for i = 1; 2; : : : ; N
N

2. Fit a G-valued "stump" (single-split tree/single cut on a single coordinate


of x) predictor/classi…er g1 to the training data to optimize
N
X
I [yi 6= g (xi )]
i=1

let
N
1 X
err1 = I [yi 6= g1 (xi )]
N i=1
and de…ne
1 err1
1 = ln
err1

3. Set new weights on the training data


1
wi2 = exp ( 1I [yi 6= g1 (xi )]) for i = 1; 2; : : : ; N
N
(This up-weights misclassi…ed observations by a factor of (1 err1 ) =err1 ).)

148
4. For m = 2; 3; : : : ; M

(a) Fit a G-valued stump predictor/classi…er gm to the training data to


optimize
XN
wim I [yi 6= g (xi )]
i=1

(b) Let
PN
i=1 wim I [yi 6= gm (xi )]
errm = PN
i=1 wim

(c) Set
1 errm
m = ln
errm
(d) Update weights as

wi(m+1) = wim exp ( mI [yi 6= gm (xi )])


1 errm
= wim I [yi = gm (xi )] + I [yi 6= gm (xi )]
errm

for i = 1; 2; : : : ; N . (This up-weights misclassi…ed observations by a


factor of (1 errm ) =errm ).)

5. Output a voting function


M
X
m gm (x)
m=1

(based on "weighted voting" by the classi…ers gm ) for an AdaBoost.M1


classi…er !
M
X
^
fM (x) = sign m gm (x)
m=1

(Classi…ers gm with small errm get big positive weights in the …nal "vot-
ing.")

Figure 31 is a graphic of a small (N = 16) fake p = 2 dataset and (single


line) boundaries of M = 7 successive "stumps" used to develop an AdaBoost.M1
classi…er with 0 training error rate. (Arrows point in the direction of y = +1
decisions.) Corresponding classi…ers are portrayed in Figure 32.

149
Figure 31: M = 7 consecutive AdaBoost.M1 cuts for a small fake data set.

AdaBoost.M1 as an Instance of General Boosting The AdaBoost.M1


algorithm is equivalent to an instance of general boosting for a voting function,
based on the exponential loss function h2 (v) exp ( v) of Section 1.5.3. The
argument is as follows. Take g1 = 12 f^1 for f^1 as in the traditional description
of AdaBoost.M1 to serve as an initial voting function to be improved through a
series of boosting steps. Suppose that iterate gm 1 is in hand and one desires
to improve (reduce) the total training loss
N
X
exp ( yi gm 1 (xi ))
i=1

by an update of voting function gm 1 to

gm (x) = gm 1 (x) + me
^m (x) (120)

where e^m is an appropriate stump classi…er ("a single split tree" classi…er) and
m is (without loss of generality) a positive constant.

150
Figure 32: Classi…ers corresponding to the voting functions from the cuts indi-
cated in Figure 31.

The total training loss associated with the iterate (120) is


N
X
exp [ yi (gm 1 (xi ) + me
^m (xi ))]
i=1
N
X
= exp ( yi gm 1 (xi )) exp ( yi me
^m (xi ))
i=1
X X
= exp ( yi gm 1 (xi )) exp ( m) + exp ( yi gm 1 (xi )) exp ( m)
i s.t. i s.t.
yi 6=e^m (xi ) yi =^em (xi )
X
= (exp ( m) exp ( m )) exp ( yi gm 1 (xi ))
i s.t.
yi 6=e^m (xi )
X
+ exp ( m) exp ( yi gm 1 (xi ))
i

So, whatever be the positive value of m , e^m (x) should be chosen to minimize
the 0-1 loss error rate for a single cut classi…er for cases weighted proportional
to values exp ( yi gm 1 (xi )).
Consider then choice of m . The derivative of the total training loss with
respect to m is
X X
exp ( yi gm 1 (xi )) exp ( m ) exp ( yi gm 1 (xi )) exp ( m )
i s.t. i s.t.
yi 6=e^m (xi ) yi =^em (xi )

This is 0 when
P
i s.t. exp ( yi gm 1 (xi ))
yi =^em (xi )
exp (2 m) =P
i s.t. exp ( yi gm 1 (xi ))
yi 6=e^m (xi )

151
That is, an optimal m is
0P 1
i s.t. exp ( yi gm 1 (xi ))
1 @ yi =^em (xi ) A
m = ln P
2 i s.t. exp ( yi gm 1 (xi ))
yi 6=e^m (xi )
1 1 rm
= ln
2 rm
for P
i s.t. exp ( yi gm 1 (xi ))
yi 6=e^m (xi )
rm = PN
i=1 exp ( yi gm 1 (xi ))
which is the 0-1 loss error rate for the classi…er e^m where weights on points in
a training set are proportional to exp ( yi gm 1 (xi )).
Notice then that the ratios of the weights at stages m 1 and m satisfy

exp ( yi gm (xi ))
exp ( yi gm 1 (xi ))
exp ( yi (gm 1 (xi ) + m e^m (xi )))
=
exp ( yi gm 1 (xi ))
= exp ( yi m e^m (xi ))
1 1 rm 1 1 rm
= exp ln I [^
em (xi ) = yi ] + exp ln I [^
em (xi ) 6= yi ]
2 rm 2 rm
1=2 1=2
1 rm 1 rm
= I [^
em (xi ) = yi ] + I [^
em (xi ) 6= yi ]
rm rm
1=2
1 rm 1 rm
= I [^
em (xi ) = yi ] + I [^
em (xi ) 6= yi ]
rm rm
Since rm doesn’t depend upon i, looking across i this is proportional to a ra-
tio of 1 for cases with I [^ em (xi ) = yi ] and ratio (1 rm ) =rm for cases with
I [^
em (xi ) = yi ]. That is (recalling the meaning of rm ) the ratios of weights for
a given case in this development are completely equivalent to those produced by
the updating prescribed in 4(d) of the standard description of AdaBoost.M1.
Ultimately then, all of this taken together establishes that this ("exact"
as opposed to "gradient") boosting development produces an mth iterate of
a voting function exactly half of that produced through m iterations of the
standard development of AdaBoost.M1. Since the factor of 21 is irrelevant
to the sign of the voting function, the corresponding classi…er is exactly the
AdaBoost.M1 classi…er.

11.5 Quinlan’s Cubist and "Divide and Conquer" Strate-


gies
There is a line of algorithms associated with Ross Quinlan (including "Cubist"
and "C5.0," the former being a SEL prediction methodology and the latter a

152
classi…er). His company web site is https://www.rulequest.com/index.html.
His algorithms are very complicated, and complete descriptions do not seem to
be publicly available. (Though there are open source versions of some of his
algorithms, much of his work seems to be proprietary and commercial versions
of his software are no doubt more reliable than the open source versions.) Text-
book descriptions of his methods are generally vague. Probably the best ones
I know of are in the KJ book.
The basic notion of Cubist seems to be to cut up an input space, <p , into
rectangles and …t a (di¤erent) linear predictor for y in each rectangle. Consider
the rectangle

R = fxj a1 < x1 < b1 ; a2 < x2 < b2 ; : : : ; ap < xp < bp g

where aj and bj can be …nite or in…nite. Where at least one of aj or bj is …nite,


a split on the input space has been made on coordinate j. Jargon typically
used in describing these methods is that if one lists only rectangles where one
or both of the aj or bj are …nite, one has speci…ed a "rule."
There are many implementation choices (the consequences of which are not
transparent) that (much as with MARS) amount to a kind of "special sauce"
owned by Quinlan and/or others who have followed him. Vague expositions
of Cubist leave most users to treat SEL prediction based on it as a mysterious
(albeit often e¤ective) "black box."
Here are a few observations based on available information on Cubist for
SEL prediction.

1. Trees of regressions (not trees with constant predictions in each …nal


rectangle) seem to be at the heart of the methodology, both generating
rectangles and making predictions. The "error" used to guide node split-
ting seems to be
X Nl p
err M SEl
N
l

where l indexes rectangles, Nl is the number of cases with xi 2 Rl and


M SEl is presumably from an OLS …t (of some linear model) in Rl .

2. Exactly what inputs xj are used in each rectangle and how they are chosen
is not clear. Output for an R implementation of Cubist lists di¤erent sets
for the various rectangles.
3. Exactly how one goes from tree building to the …nal set of rules/rectangles
is not clear. Software seems to not allow control of this. Perhaps there
is some kind of combining of …nal rectangles from a tree.
4. Some sort of "smoothing" is involved. This seems to be some kind of
averaging of regressions for bigger (containing) rectangles "up the tree
branch" from a …nal rectangle. What this should mean is not absolutely
clear if all one has is a set of "rules," particularly if there are cuts less

153
extreme than a …nal pair de…ning Rl that have been eliminated from
description of Rl . For example

3 < x1 < 5

is
3 < x1 < 10 and 3 < x1 < 5
Further, the form of weights used in the averaging seems completely ad
hoc.

There are two serious modi…cations of the basic "tree of regressions" notion
that are included in the R implementation of Cubist:

1. One may employ "committees." This seems to be boosting or some-


thing much like it applied using the basic algorithm to create the correc-
tions to successive versions of an approximate E[yjx].
2. One may employ "instances." This seems to be (optionally) applied after
the boosting. It is shrinkage of y^s in light of yi s for k-nearest neighbors,
using weights depending upon the distances from x to the neighbors. For
k cases closest to x, say cases i1 ; i2 ; : : : ; ik and corresponding weights
w1 ; w2 ; : : : ; wk (summing to 1?) the prediction used for input x seems
from KJ to be36 X
y^ (x) + wl (yil y^il )

A valuable general perspective that consideration of Quinlan’s speci…c meth-


ods brings up might be called a divide and conquer strategy. In prediction
problems where p is at all large, it is rare that one can …nd a simple form for
a predictor that is e¤ective across the entirety of an input space. One way to
think about Quinlan’s methods is as breaking an input space up into appropriate
rectangles (de…ned by a tree structure) and then using primarily a (relatively
simple) linear prediction form inside each rectangle. Of course, "the devil is
in the details" of …nding appropriate means of partitioning an input space and
then simple forms to use in each piece of the space, but the general notion of
solving several "local" prediction problems rather than a larger "global" one
is clearly one that will on occasion be very e¤ective. For, example, in a case
where a few (say l) coordinates of an input vector x are binary, it may make
more sense to separately …t 2l predictors (one for each possible binary vector)
using the p l non-binary inputs instead of trying to …t a single predictor using
the entire p-dimensional input.

3 6 This is a guess at a "correction" of a formula on KJ page 210 that seems incorrect.

154
Part III
Intermission: Perspective and
Prediction in Practice
There is more to say about theory and speci…c methodology for statistical ma-
chine learning, but this is a sensible point at which to pause and re‡ect on the
practice of prediction in "big data" contexts. Most of the best-known prediction
methods have been discussed (the notable exception being linear classi…cation
methods and especially so-called support vector machines covered in the next
chapter) and the basic concerns to be faced have been raised. The careful
reader has what is needed in terms of statistical background to begin work
on a large prediction problem. So here we provide a bit of summary discus-
sion/perspective on beginning practice. (The material in the balance of these
notes can be studied in parallel with practice on a large real problem. It is my
belief that such practice grappling with the realities of prediction is essential to
genuine understanding of modern statistical machine learning.)
The graphic in Figure 33 is intended to provide some conceptualization of
what must be done to make predictions and honest judgments of how well they
are likely to work. The graphic is meant to indicate that a project proceeds
more or less left to right through it, but that actual practice is far too iterative
and ‡exible to be adequately represented by a ‡owchart.

Figure 33: Elements of e¤ective "big data" prediction

One must …rst assemble a training set from whatever sources are appropriate.
Consistent with the "divide and conquer" discussion at the end of Section 11.5,
this training set could represent only a well-de…ned part of a large input space
and multiple graphics like Figure 33 in parallel would then in order. Note
that if a breakup of the input space depends upon the data cases available
(as in Quinlan’s methodologies, where rectangles used depend upon the set
of input vectors considered) that activity is best conceptualized as happening
inside the big cross-validation box, perhaps before several parallel versions of
what is presently Figure 33. The point is that the initial development of the
training set is the conceptual base upon which all else is built and (at least if
one is hoping to have reliable cross-validation results) a "random draws from a
…xed universe" model must be a plausible description of both the elements of

155
the training set and additional "test" cases that are to be predicted.
Figure 33 puts "feature engineering" and "predictor …tting" activities inside
a large single activity box. These are typically spoken of as if they were distinct,
but they are largely indistinguishable/inseparable. (This is usually emphasized
quite strongly by fans of neural network prediction, where the process of devel-
oping weights for linear combinations deep in the compositional structure of the
predictor is often spoken of in terms of "learning good features" for prediction.)
The cross-validation box in Figure 33 encloses all but the assembling of the
training set. This is a reminder of the basic principle that all that will ultimately
be done to make a predictor must be done K times (on the K remainders) in
order to create a reliable assessment of the likely e¤ectiveness of a prediction
methodology. Various "tuning" or "optimizing" steps based on some "cross-
validation error" or "OOB error" measures may be employed in the …tting of
a single one of multiple predictors in an ensemble, but only the kind of com-
prehensive "complete redoing" suggested by placing everything except training
set assembly inside the largest activity box will be adequate as an indication of
likely performance on new test cases.
Ultimately, producing good predictors in big real-world problems is a highly
creative and interesting pursuit. What is presented in these notes amounts to
a set of principles and building blocks that can be assembled in myriad ways.
The fun is in …nding clever problem-speci…c ways to do the assembly that prove
to be practically e¤ective.

Part IV
Supervised Learning II: More on
Classi…cation and Additional
Theory
12 Basic Linear (and a Bit on Quadratic) Meth-
ods of Classi…cation
Consider now methods of producing prediction/classi…cation b
n orules f (x) taking
values in G = f1; 2; : : : ; Kg that have sets x 2 <p jfb(x) = k with boundaries
that are (mostly) de…ned by linear equalities

x0 =c (121)

The most obvious/naive


h i potential method here is to regress K indicator vari-
ables yk = I fb(x) = k onto x (producing least squares regression vector coef-

156
…cients b k ) and then to employ

fb(x) = arg maxfbk (x) = arg maxx0 b k


k k

But this often fails miserably because of the possibility of "masking" if K > 2.
One must be smarter than this. Three kinds of smarter alternatives are Linear
(and Quadratic) Discriminant Analysis, Logistic Regression, and direct searches
for separating hyperplanes. The …rst two of these are "statistical" in origin with
long histories in the …eld.

12.1 Linear (and a bit on Quadratic) Discriminant Analy-


sis
Suppose that for (x; y) P , k = P [y = k] and the conditional distribution of
x on <p given that y = k is MVNp ( k ; ), i.e. the conditional pdf is

p=2 1=2 1 0 1
p (xjk) = (2 ) (det ) exp (x k) (x k)
2
Then it follows that
P [y = kjx] k 1 0 1 1 0 1 1
ln = ln k k + l l + x0 ( k l) (122)
P [y = ljx] l 2 2
so that a theoretically optimal classi…er/decision rule is
1 0 1 1
f (x) = arg max ln ( k) k k + x0 k
k 2
and boundaries between regions in <p where f (x) = k and f (x) = l are subsets
of the sets
1 k 1 1 1 1
x 2 <p j x0 ( k l) = ln + 0
k k
0
l l
l 2 2
i.e. are de…ned by equalities of the form (121). Figure 34 illustrates this in a
simple K = 3 and p = 2 context where all k s are the same.
This is dependent upon all K conditional normal distributions having the
same covariance matrix, . In the event these are allowed to vary, condi-
tional distribution k with covariance matrix k , a theoretically optimal predic-
tor/decision rule is
1 1 0 1
f (x) = arg max ln ( k) ln (det k) (x k) k (x k)
k 2 2
and boundaries between regions in <p where f (x) = k and f (x) = l are subsets
of the sets
0 1 0 1
fx 2 <p j 12 (x k) k (x k)
1
2 (x l) l (x l) =
1 1
ln k
l 2 ln (det k) + 2 ln (det l )g

157
Figure 34: Contours of K = 3 bivariate normal pdfs and corresponding linear
(equal class probability) classi…cation boundaries.

Unless k = l this kind of set is a quadratic surface in <p , not a hyperplane.


One gets (not linear, but) Quadratic Discriminant Analysis.
Of course, in order to use LDA or QDA, one must estimate the vectors k
and the covariance matrix or matrices k from the training data. Estimating
K potentially di¤erent matrices k requires estimation of a very large number of
parameters. So thinking about QDA versus LDA, one is again in the situation
of needing to …nd the level of predictor complexity that a given dataset will
support. QDA is a more ‡exible/complex method than LDA, but using it in
preference to LDA increases the likelihood of over…t and poor prediction.
One idea that has been o¤ered as a kind of continuous compromise between
LDA and QDA is for 2 (0; 1) to use

bk ( ) = b k + (1 ) b p o oled

in place of b k in QDA. This kind of thinking even suggests as an estimate of


a covariance matrix common across k
b ( ) = b p o oled + (1 ) c2 I

for 2 (0; 1) and c2 an estimate of variance pooled across groups k and then
across coordinates xj of x in LDA. Combining these two ideas, one might even
invent a two-parameter set of …tted covariance matrices

bk ( ; ) = b k + (1 ) b p o oled + (1 ) c2 I

for use in QDA. Employing these in LDA or QDA provides the ‡exibility
of choosing a complexity parameter or parameters and potentially improving
prediction performance.
The form x0 is (of course and by design) linear in the coordinates of x. An
obvious natural generalization of this discussion is to consider discriminants that

158
are linear in some (non-linear) functions of the coordinates of x. This is simply
choosing some M basis functions/transforms/features hm (x) and replacing the
p coordinates of x with the M coordinates of (h1 (x) ; h2 (x) ; : : : ; hM (x)) in the
development of LDA.
Of course, upon choosing basis functions that are all coordinates, squares
of coordinates, and products of coordinates of x, one produces linear (in the
basis functions) discriminants that are general quadratic functions of x. The
possibilities opened here are myriad and (as always) "the devil is in the details."

12.1.1 Dimension Reduction in LDA


Where p is large, a common methodology in LDA is forward selection of coor-
dinates xj of x to use in classi…cation. Cross-validation can be used to choose
a best number of coordinates and potentially achieve some dimension-reduction
and reduce over…tting.
Another idea in the direction of simplifying the interpretation of LDA by
dimension-reduction is use of "canonical coordinates" and intends to replace
"variable selection" with use of a hopefully few relevant linear combinations of
coordinates (producing "reduced rank LDA"). Let
K
1 X
= k
K
k=1

and note that one is free to replace x and all K means k with respectively
1=2 1=2
x = (x ) and k = ( k )
This produces
P [y = kjx ] k 1 2 1 2
ln = ln kx kk + kx lk
P [y = ljx ] l 2 2
and (in "sphered" form) the theoretically optimal classi…er can be described as
1 2
f (x) = arg max ln ( k) kx kk
k 2
That is, in terms of x , optimal decisions are based on ordinary Euclidian
distances to the transformed means k . Further, this form can often be made
even simpler/be seen to depend upon a lower-dimensional (than p) distance.
The k typically span a subspace of <p of dimension min (p; K 1). For
M =( 1; 2; : : : ; K)
p K

let P M be the p p projection matrix projecting onto the column space of M


in <p (C (M )). Then
2 2
kx kk = k[P M + (I P M )] (x k )k
2
= k(P M x k ) + (I PM)x k
2 2
= kP M x k k + k(I PM)x k

159
the last equality coming because (P M x k ) 2 C (M ) and (I PM)x 2
? 2
C (M ) . Since k(I P M ) x k doesn’t depend upon k, the theoretically
optimal predictor/decision rule can be described as

1 2
f (x) = arg max ln ( k) kP M x kk
k 2

and theoretically optimal decision rules can be described in terms of the projec-
tion of x onto C (M ) and its distances to the k .
Now,
1
MM0
K
is the (typically rank min (p; K 1)) sample covariance matrix of the k and
has an eigen decomposition as
1
M M 0 = V DV 0
K
for
D = diag (d1 ; d2 ; : : : ; dp )
where
d1 d2 dp
are the eigenvalues and the columns of V are orthonormal eigenvectors corre-
sponding in order to the successively smaller eigenvalues of K 1
M M 0 . These
v k with dk > 0 specify linear combinations of the coordinates of the l ,
hv k ; l i, with the largest possible sample variances subject to the constraints
that kvk = 1 and hv l ; v k i = 0 for all l < k. These v k are perpendicular vectors
in successive directions of most important unaccounted-for spread of the k .
Then, for l rank M M 0 de…ne

V l = (v 1 ; v 2 ; : : : ; v l )

let
P l = V l V 0l
be the matrix projecting onto C (V l ) in <p : A possible "reduced rank" approx-
imation to the theoretically optimal LDA classi…cation rule is

1 2
fl (x) = arg max ln ( k) kP l x Pl kk
k 2

and l becomes a complexity parameter that one might optimize via cross-
validation to tune or regularize the method.
Note also that for w 2 <p
l
X
P lw = hv k ; wi v k
k=1

160
For purposes of graphical representation of what is going on in these computa-
tions, one might replace the p coordinates of x and the means k with the l
coordinates of
0
(hv 1 ; x i ; hv 2 ; x i ; : : : ; hv l ; x i) (123)
and of the
0
(hv 1 ; k i ; hv 2 ; k i ; : : : ; hv l ; k i) (124)
(that might be called "canonical coordinates"). It seems to be ordered pairs of
entries of these vectors that are plotted by HTF in their Figures 4.8 and 4.11.
In this regard, we need to point out that since any eigenvector v k could be
replaced by v k without any fundamental e¤ect in the above development, the
vector (123) and all of the vectors (124) could be altered by multiplication of
any particular set of coordinates by 1. (Whether a particular algorithm for
…nding eigenvectors produces v k or v k is not fundamental, and there seems
to be no standard convention in this regard.) It appears that the pictures in
HTF might have been made using the R function lda and its choice of signs for
eigenvectors.

12.2 Logistic Regression


A generalization of the MVN conditional distribution result (122) is an assump-
tion that for all k < K
P [y = kjx]
ln = k0 + x0 k (125)
P [y = Kjx]

Here there are K 1 constants k0 and K 1 p-vectors k to be speci…ed,


not necessarily tied to class mean vectors or a common within-class covariance
matrix for x: In fact, the set of relationships (125) do not fully specify a joint
distribution for (x; y). Rather, they only specify the nature of the conditional
distributions of yjx. (In this regard, the situation is exactly analogous to that
in ordinary simple linear regression. A bivariate normal distribution for (x; y)
gets one normal conditional distributions for y with a constant variance and
mean linear in x. But one may make those assumptions conditionally on x,
without assuming anything about the marginal distribution of x, that in the
bivariate normal model is univariate normal.)
Using as shorthand for a vector containing all the constants k0 and the
vectors k , the linear log probability ratio assumption (125) produces the forms

exp ( k0 + x0 k)
P [y = kjx] = pk (x; ) = PK 1 (126)
1+ k=1 exp ( k0 + x0 k)

for k < K, and


1
P [y = Kjx] = pK (x; ) = PK 1
(127)
1+ k=1 exp ( k0 + x0 k)

161
and a theoretically optimal (under 0-1 loss) predictor/classi…cation rule is

f (x) = arg max pk (x; )


k

As a bit of an aside, it is perhaps useful to see in forms (126) and (127) use
of the softmax function with linear combinations of the coordinates of x and
be reminded of the neural network discussion of Section 8.2. In that regard,
consider an extremely simple neural network for classi…cation having no hidden
layers and all coe¢ cients for the last output node set to 0. That is, with
no hidden layers, if in the notation of Section 8.3.1 the last column of A0 by
assumption contains only 0s (A0K = 0), the corresponding "neural network" for
classi…cation is exactly the K-class logistic regression model.
Figure 35 is a plot of three di¤erent p = 1 forms for p1 (x; 0 ; 1 ) in a K = 2
model. The parameter sets are

Red: 0 = 0; 1 = 1;
Blue: 0 = 4; 1 = 2; and
Green: 0 = 2; 1 = 2

In each case p1 (x; 0 ; 1 ) = :5 where x = 0 = 1 , the function increases in x


exactly when 1 > 0, and curve steepness increases with j 1 j.

Figure 35: Plot of three di¤erent p = 1 forms for p1 (x; 0; 1) in a K = 2 model.

In a K = 2 case with p = 2, (for its f1; 2g coding of y) the kind of relationship


pictured in Figure 36 holds. p1 (x; 0 ; 1 ; 2 ) de…nes an "s-shaped surface" that
is "steep" when coe¢ cients 1 ; 2 have large absolute values, is constant on lines
2
0 + 1 x1 + 2 x2 = c in < , taking the value :5 on the line 0 + 1 x1 + 2 x2 = 0.
Assumption (125) generalizes the "mixture of MVNs" assumption of LDA,
and standard methods of …tting the corresponding parameters based on training
data are necessarily fundamentally di¤erent. That is (using maximum likeli-
hood) in LDA, the K probabilities k , the K means k , and the covariance
matrix might be chosen to maximize the likelihood
N
Y
yi p xi j yi ;
i=1

162
Figure 36: A plot of a p = 2 form for p1 (x; 0; 1) in a K = 2 model (with 1-2
coding).

This is a mixture model and the complete likelihood is involved, i.e. a joint
density for the N pairs (xi ; yi ). On the other hand, standard logistic regression
methodology maximizes
YN
pyi (xi ; ) (128)
i=1

over choices of . This is not a full likelihood, but rather one conditional on
the xi observed.
In a K = 2 case with 1-1 coding for y, the logistic regression log-likelihood
has a very simple form. With
exp ( 0 + x0 ) 1
p 1 (x; 0; )= and p1 (x; 0; )=
1 + exp ( 0 + x0 ) 1 + exp ( 0 + x0 )

the likelihood term contributed to the product (128) by (xi ; yi ) is

I [yi = 1] I [yi = 1]
exp ( 0 + x0i ) 1
1 + exp ( 0 + x0i ) 1 + exp ( 0 + x0i )

It then follows that the contribution of (xi ; yi ) to the log-likelihood is

I [yi = 1] ( 0 + x0i ) ln (1 + exp ( 0 + x0i ))


= ln (1 + exp (yi ( 0 + x0i )))

(Note that this term is (ln 2) h1 (yi ( 0 + x0i )) for h1 the …rst of the function
"losses" considered in Section 1.5.3 in the discussion of voting functions in 2-
class classi…cation.) So ultimately, the K = 2 log-likelihood (to be optimized
in ML …tting) is
XN
ln (1 + exp (yi ( 0 + x0i )))
i=1

163
(which is ln 2 times the total loss in the gradient boosting algorithm applied
to voting function g (x) = 0 + x0i ).
A general alternative to maximum likelihood (useful in avoiding over…tting
for large N ) is minimization of a criterion like
N
!
Y
ln pyi (xi ; ) + penalty ( )
i=1

For example, in the K = 2 case (with 1-1 coding) a lasso version is (for >0
and 0 1) minimization of
0 1
N
X p
X p
(1 )X
ln (1 + exp (yi ( 0 + x0i ))) + @ j jj + 2A
j
i=1 j=1
2 j=1

(that can be accomplished in R using glmnet).


It is common to encounter situations where (say in a K = 2 context with
0-1 coding) 0 is quite small. Rather than trying to do analysis on a random
sample of (x; y) pairs where there would be relatively few y = 0 cases, there
are a number of potentially important practical reasons for doing analysis of a
dataset consisting of random sample of N0 instances ("cases") with y = 0 and a
random sample of N1 instances ("controls") with y = 1, where N0 = (N0 + N1 )
is nowhere nearly as small as 0 .37 (In fact, N1 on the order of 5 or 6 times
N0 is often recommended.)
For K = 2
P [y = 0jx] 0 p (xj0) 0 p (xj0)
ln = ln = ln + ln
P [y = 1jx] 1 p (xj1) 1 p (xj1)
So under the logistic regression assumption that
P [y = 0jx]
ln = 0 + x0
P [y = 1jx]
…tting to a case-control dataset should produce

^cc + x0 ^cc N0 p (xj0)


0 ln + ln
N1 p (xj1)
P [y = 0jx] N0 0
= ln + ln ln
P [y = 1jx] N1 1

So (presuming that an estimate ^0 is available) estimated coe¢ cients

^0 ^cc N0 ^0 cc
0 ln + ln and ^ = ^
N1 1 ^0
3 7 Notice that this methodology purposely creates a situation like that described in Sec-

tion 1.5.1, where training set class relative frequencies are much di¤erent from actual class
probabilities.

164
Figure 37: An example of a quadratic form ( :2x21 :3x22 ) used to make logistic
regression probabilities that y = 1 (for 1-2 coding)

are appropriate for the original context. (This result is a specialization of the
general formula (29) for shifting conditional probabilities for yjx based on use
of a training set with class frequencies di¤erent from the k s.)
Good logistic regression models are the basis of good classi…ers when one
classi…es according to the largest predicted probability. And just as the useful-
ness of LDA can be extended by consideration of transforms/features made from
an original p-dimensional x, the same is true for logistic regression. For ex-
ample, beginning with x1 and x2 and creating additional predictors x21 ; x22 ; and
x1 x2 , one can use logistic regression technology based on the 5-dimensional in-
put x1 ; x2 ; x21 ; x22 ; x1 x2 to create classi…cation boundaries that are quadratic
in terms of the original x1 and x2 . An example of the kind of functional form
for the conditional probability that y = k given a bivariate input x that can
result is portrayed in Figure 37 where the quadratic form :2x21 :3x22 is used
to make logistic regression probabilities that y = 1 (for 1-2 coding). Constant-
probability contours of such a surface are ellipses in (x1 ; x2 )-space.

12.3 Separating Hyperplanes


In the K = 2 group case now use the G = f 1; 1g coding. If there is a 2 <p
and real number 0 such that in the training data
y = 1 exactly when x0 + 0 >0
a "separating hyperplane"
fx 2 <p j x0 + 0 = 0g
can be found via logistic regression. The (conditional) likelihood will not have
a maximum, but if one follows a search path far enough toward the limiting

165
value of 0 for the loglikelihood or 1 for the likelihood, satisfactory 2 <p and
0 from an iteration of the search algorithm will produce separation.
A famous older algorithm for …nding a separating hyperplane is the so-called
"perceptron" algorithm. It can be de…ned as follows. From some starting
points 0 and 00 cycle through the training data cases in order (repeatedly as
needed). At any iteration l, take
n o yi = 1 and x0i + > 0, or
l l 1 l l 1 0
= and = if
0 0 yi = 1 and x0i + 0 0
l l 1
= + yi xi
l l 1 otherwise
and 0 = 0 + yi

This will eventually identify a separating hyperplane when a series of N itera-


tions fails to change the values of and 0 .
If there is a separating hyperplane, it will typically not be unique. One
can attempt to de…ne and search for "optimal" such hyperplanes that, e.g.,
maximize distance from the plane to the closest training vector. The material
on "support vector classi…ers" in Section 13.1 is a famous development in this
direction.

13 Support Vector Machines


Consider a 2-class classi…cation problem. For notational convenience, we’ll
suppose that output y takes values in G = f 1; 1g. Our present concern is in a
further development of linear classi…cation methodology beyond that provided
in Section 12.
For 2 <p and 0 2 < we’ll consider the voting function

g (x) = x0 + 0 (129)

and a theoretical predictor/classi…er

f (x) = sign (g (x)) (130)

We will approach the problem of choosing and 0 to in some sense provide a


maximal cushion around a hyperplane separating between xi with corresponding
yi = 1 and xi with corresponding yi = 1:

13.1 The Linearly Separable Case: Maximum Margin Clas-


si…ers
In the case that there is a classi…er of form (130) with 0 training error rate, we
consider the optimization problem

maximize M subject to yi (x0i u + 0) M 8i (131)


u with kuk = 1
and 0 2 <

166
This can be thought of in terms of choosing a unit vector u (or direction) in <p so
that upon projecting the training input vectors xi onto the subspace of multiples
of u there is maximum separation between the convex hull of projections of the
xi with yi = 1 and the convex hull of projections of xi with corresponding
yi = 1. (The sign on u is chosen to give the latter larger x0i u than the former.)
If u and 0 solve this maximization problem the (maximum) margin is then
0 1

1BB
C
C
M= B min x0 u max x0i uC
2 @ xi with i xi with A
yi = 1 yi = 1

and the constant that makes the voting function (129) take the value 0 is
0 1

1BB
C
C
0 = B min x0i u + max x0i uC
2 @ xi with xi with A
yi = 1 yi = 1

The geometry of this formalism in a small p = 2 case is illustrated in Figure 38.

Figure 38: The geometry of maximum margin classi…cation for a small p = 2


example.

For purposes of applying standard optimization theory and software, it is


useful to reformulate the basic problem (131) several ways. First, note that
optimization problem (131) may be rewritten as

u 0
maximize M subject to yi x0i + 1 8i (132)
u with kuk = 1 M M
and 0 2 <

Then if we let
u
=
M

167
it’s the case that
1 1
k k= or M =
M k k
so that problem (132) can be rewritten
1 2
minimize k k subject to yi (x0i + 0) 1 8i (133)
2< p 2
and 0 2 <

This formulation (133) is that of a convex (quadratic criterion, linear inequality


constraints) optimization problem for which there exists standard theory and
algorithms.
The so-called primal functional corresponding to problem (133) is (for 2
<N )
N
X
1 2
FP ( ; 0; ) k k i (yi (x0i + 0) 1) for 0
2 i=1

To solve problem (133), one may for each 0 choose ( ( ) ; 0 ( )) to


minimize FP ( ; ; ) and then choose 0 to maximize FP ( ( ) ; 0 ( ) ; ).
The Karush-Kuhn-Tucker conditions are necessary and su¢ cient for solution
of a constrained optimization problem. In the present context they are the
gradient conditions
XN
@FP ( ; 0 ; )
= i yi = 0 (134)
@ 0 i=1

and
N
X
@FP ( ; 0; )
= i yi xi =0 (135)
@ i=1

the feasibility conditions

yi (x0i + 0) 1 0 8i (136)
the non-negativity conditions

0 (137)
and the orthogonality conditions

i (yi (x0i + 0) 1) = 0 8i (138)

Now relationships (134) and (135) are respectively


N
X N
X
i yi = 0 and = i yi xi ( ) (139)
i=1 i=1

168
and plugging these into FP ( ; 0; ) gives a function of only
N
X
1 2
FD ( ) k ( )k i (yi x0i ( ) 1)
2 i=1
1 XX 0
XX
0
X
= i j yi yj xi xj i j yi yj xi xj + i
2 i j i j i
X 1 XX 0
= i i j yi yj xi xj
i
2 i j
1
= 10 0
H
2
for
H = (yi yj x0i xj ) (140)
N N

Then the "dual" problem for problem (133) is the N -dimensional optimization
problem
1
maximize 10 0
H subject to 0 and 0
y=0 (141)
2<N 2
and apparently this problem is easily solved.
Now condition (138) implies that if iopt > 0

yi x0i opt
+ 0
opt
=1

so that

1. by condition (136) the corresponding xi has minimum x0i ( opt ) for train-
ing vectors with yi = 1 or maximum x0i ( opt ) for training vectors with
yi = 1 (so that xi is a support vector for the "slab" of thickness 2M
around a separating hyperplane),
opt
2. 0 ( ) may be determined using the corresponding xi from
opt
yi 0 =1 yi x0i opt
i.e. 0
opt
= yi x0i opt

(apparently for reasons of numerical stability it is common practice to


average values yi x0i ( opt ) for support vectors in order to evaluate
opt
0( )), and
3.
0 10
N
X
1 = yi 0
opt
+ yi @ opt
j yj xj
A xi
j=1
N
X
opt opt 0
= yi 0 + j yj yi xj xi
j=1

169
PN
The fact (139) that ( ) i=1 i yi xi implies that only the training
cases with i > 0 (typically corresponding to a relatively few support vectors)
determine the nature of the solution to this optimization problem. Further, for
SV the set indices of support vectors in the problem,
2 X X opt opt
opt 0
= i j yi yj xi xj
i2SV j2SV
X opt
X opt 0
= i j yi yj xj xi
i2SV j2SV
X opt opt
= i 1 yi 0
i2SV
X opt
= i
i2SV

the next to last of these equalities following from 3. above, and the last following
from the gradient condition (134). Then the margin for this problem is simply
1 1
M= opt )k
= qP (142)
k ( opt
i2SV i

13.2 The Linearly Non-separable Case: Support Vector


Classi…ers
In a linearly non-separable case, the convex optimization problem (133) does
not have a solution (no pair 2 <p and 0 2 < provides yi (x0i + 0 ) 1 8i).
We might, therefore (in looking for good choices of 2 <p and 0 2 <) try to
relax the constraints of the problem slightly. That is, suppose that i 0 for
i = 1; 2; : : : ; N and consider the set of constraints

yi (x0i + 0) + i 1 8i

(the i are called "slack" variables and provide some "wiggle room" in search
for a hyperplane that "nearly" separates the two classes with a good margin).
We might try to control the total amount of slack allowed by setting a bound
N
X
i C
i=1

for some positive C (a "budget").


Note that if yi (x0i + 0 ) 0, case i is correctly classi…ed in the training
set, and so if for some pair 2 <p and 0 2 < this holds for all i, we have
a separable problem. So any non-separable problem must have at least one
negative yi (x0i + 0 ) for any 2 <p and 0 2 < pair. This in turn requires
that the budget C must be at least 1 for a non-separable problem to have
a solution even with the addition of slack variables. In fact, this reasoning

170
implies that a budget C allows for at most C misclassi…cations in the training
set. And in a non-separable case, C must be allowed to be large enough so that
some choice of 2 <p and 0 2 < produces a classi…er with training error rate
no larger than C=N .
In any event, we consider the optimization problem
1 2 yi (x0i + 0 ) + i 1 8i
minimize k k subject to PN
2 <p 2 for some i 0 with i=1 i C
and 0 2 <
(143)
that can be thought of as generalizing the problem (133). Problem (143) is
equivalent to
yi (x0i u + 0) M (1 ) 8i
maximize M subject to PN i
u with kuk = 1 for some i 0 with i=1 i C
and 0 2 <
generalizing the original problem (131). In this latter formulation, the i rep-
resent fractions (of the margin) that a corresponding xi is allowed to be on the
"wrong side" of its cushion around the classi…cation boundary. i > 1 indicates
that not only does xi violate its cushion around the surface in <p de…ned by
x0 u + 0 = 0 but that the classi…er misclassi…es that case.
The ideas and notation of this development are illustrated in Figure 39 for
a small p = 2 problem.

Figure 39: A toy p = 2 example illustrating the notation used in non-separable


support vector optimization problem statements.

A more convenient version of form (143) is


XN
1 2 yi (x0i + 0 ) + i 1 8i
minimize k k +C i subject to
2 <p 2 for some i 0
i=1
and 0 2 <
(144)

171
A nice development on pages 376-378 of Izenman’s book provides the follow-
ing solution to this problem (144) parallel to the development in Section 13.1.
Generalizing problem (141) is the dual problem
1
maximize 10 0
H subject to 0 C 1 and 0
y=0 (145)
2<N 2
for
H = (yi yj x0i xj ) (146)
N N

The constraint 0 C 1 is known as a "box constraint" and the "feasible


region" prescribed in form (145) is the intersection of a hyperplane de…ned by
0
y = 0 and a "box" in the positive orthant. The C = 1 version of this
reduces to the "hard margin" separable case.
Upon solving problem (145) for opt , the optimal 2 <p is of the form
X opt
opt
= i yi xi (147)
i2SV

for SV the set of indices of support vectors xi which have iopt > 0. The
points with 0 < iopt < C will lie on the edge of the margin (have i = 0) and
the ones with iopt = C have i > 0. Any of the support vectors on the edge
of the margin (with 0 < iopt < C ) may be used to solve for 0 2 < as
opt
0 = yi x0i opt
(148)

and again, apparently for reasons of numerical stability it is common practice


to average values yi x0i ( opt ) for such support vectors in order to evaluate
opt
0( ). And here (as in the "hard margin"/no slack case) the margin is
related to the coe¢ cients as in display (142).
In this process the constant C functions as a regularization/complexity
parameter and large C in form (144) corresponds to small C in form (143).
Identi…cation of a classi…er requires only solution of the dual problem (145) and
then evaluation of the right hand sides of formulas (147) and (148) to produce
linear form (129) and classi…er (130). Figure 40 illustrates two di¤erent support
vector classi…ers for a small p = 2 problem.
Even when a problem is linearly separable, there may be good reason to use
the present formulation with C < 1 (and a correspondingly larger margin and
more support vectors). Small C (large C) corresponds to "low complexity"
in choice of a classi…er and there are many support vectors contributing to the
ultimate form of the classi…er. This makes the exact form of the classi…er less
sensitive to a few key data cases than for large C . (If the problem were SEL
prediction rather than classi…cation, small C would be the "low variance/high
bias" case.) Cross-validation can be used in practice to choose an appropriate
value for C .

172
Figure 40: Two support vector classi…ers for a small p = 2 problem.

13.3 SV Classi…ers and Kernels: Support Vector Machines


The form (129) is (of course and by design) linear in the coordinates of x. A
natural generalization of this development would be to consider forms that are
linear in some (non-linear) functions of the coordinates of x. There is nothing
really new or special to SV classi…ers associated with this possibility if it is
applied by simply de…ning some basis functions hm (x) and considering form
0 10
h1 (x)
B h2 (x) C
B C
g (x) = B .. C + 0
@ . A
hM (x)
for use as a voting function in a classi…er sign(g (x)). However, the fact that in
both linearly separable and linearly non-separable cases, optimal SV classi…ers
depend upon the training input vectors xi only through their inner products
(see again displays (140) and (146)) and experience with computing abstract
inner produces in function spaces using kernel values suggests another way in
which one might employ linear forms of nonlinear functions in classi…cation.

13.3.1 Heuristics
Let K be a non-negative de…nite kernel and consider the possibility of using
functions K (x; x1 ) ; K (x; x2 ) ; : : : ; K (x; xN ) to build new (N -dimensional data-
dependent) feature vectors
0 1
K (x; x1 )
B K (x; x2 ) C
B C
k (x) = B .. C
@ . A
K (x; xN )

173
for any input vector x (including the xi in the training set) and rather than
de…ning inner products for new feature vectors (for input vectors x and z) in
terms of <N inner products
N
X
0
hk (x) ; k (z)i = k (x) k (z) = K (x; xk ) K (z; xk )
k=1

instead consider using the abstract space inner products of corresponding func-
tions
hK (x; ) ; K (z; )iA = K (x; z)
Then, in place of de…nition (140) or (146) de…ne

H = (yi yj K (xi ; xj )) (149)


N N

opt
and let solve either problem (141) or (145). With
N
X
opt opt
= i yi k (xi )
i=1

as in the developments of the previous sections, we replace the <N inner product
of ( opt ) and a feature vector k (x) with
*N + N
X opt X opt
i yi K (xi ; ) ; K (x; ) = i yi hK (xi ; ) ; K (x; )iA
i=1 i=1
A
N
X opt
= i yi K (x; xi )
i=1

Then for any i for which iopt > 0 (an index corresponding to a support feature
vector in this context) we set
N
X
opt opt
0 = yi j yj K (xi ; xj )
j=1

and have an empirical analogue of voting function (129) (for the kernel case)
N
X opt opt
g^ (x) = i yi K (x; xi ) + 0 (150)
i=1

with corresponding classi…er

f^ (x) = sign (^
g (x)) (151)

as an empirical analogue of classi…er (130). It remains to argue that this


classi…er (developed completely heuristically) has any kind of rational basis.

174
13.3.2 A Penalized-Fitting Function-Space Optimization Argument
The heuristic argument for the use of kernels in the SVM context to produce
form (150) and classi…er (151) is clever enough that some authors simply let
it stand on its own as "justi…cation" for using "the kernel trick" of replacing
<N inner products of feature vectors with A inner products of basis functions.
Far more satisfying arguments can be made. One is based on an appeal to
optimality/regularization considerations provided in a 2002 Machine Learning
paper of Lin, Wahba, Zhang, and Lee.
Consider A, an abstract function space38 associated with the non-negative
de…nite kernel K, and the penalized …tting optimization problem involving the
"hinge loss" from Section 1.5.3,
N
X 1 2
minimize (1 yi ( 0 + g (xi )))+ + kgkA (152)
g2A i=1
2
and 0 2 <
Dividing the whole optimization criterion in display (152) (hinge loss plus con-
stant times squared A norm) by N , we see that an empirical version of the
expected hinge loss is involved, and can on the basis of the exposition in Sec-
tion 1.5.3 hope that an element g of A and value 0 will be identi…ed in the
minimization so that 0 + g (x) is close to the voting function for the optimal
0-1 loss classi…er and controls 0-1 loss error rate.
Further, recalling the form (143), the quantity (1 yi (x0i + 0 ))+ is the
fraction of the margin (M ) that input xi violates its cushion around the classi-
…cation boundary hyperplane. (Points on the "right" side of their cushion don’t
get penalized at all. Ones with (1 yi (x0i + 0 ))+ = 1 are on the classi…ca-
tion boundary. Ones with (1 yi (x0i + 0 ))+ > 1 are points misclassi…ed by
the voting function.) The average of such terms is an average fraction (of the
margin) violation of the cushion and the optimization seeks to control this, and
so the loss really is related to the SV classi…cation ideas.
Then, exactly as will be noted in Section 15, an optimizing g 2 A above
must be of the form
N
X
g (x) = j K (x; xj )
j=1
0
= k (x)
so the minimization problem is
2
N
X N
0 1 X
minimize 1 yi 0+ k (xi ) + + j K (x; xj )
2 <N i=1
2 j=1
A
and 0 2 <
3 8 To be technically precise, we are talking here about the "Reproducing Kernel Hilbert

Space" (RKHS) related to K. This an abstract function space A consisting of all linear
combinations of slices of the kernel, K (x; ) and limits of such linear combinations.

175
that is,
N
X
0 1 0
minimize 1 yi 0 + k (xi ) +
+ K
2 <N i=1
2
and 0 2 <
for
K = (K (xi ; xj ))
N N

the Gram matrix …rst de…ned in display (21).


Now this is equivalent to the optimization problem
N
X 0
1 0 yi k (xi ) + 0 + i 1 8i
minimize i + K subject to
2 <N 2 for some i 0
i=1
and 0 2 <
(153)
which for H = (yi yj K (xi ; xj )) as in (149) has dual problem of the form
N N

1
maximize 10 0
H subject to 0 1 and 0
y=0 (154)
2<N 2
or
1 1
maximize 10 0
2
H subject to 0 1 and 0
y = 0 (155)
2<N 2

That is, the function space optimization problem (152) has a dual that is the
same as for problem (145) for the choice of C = and kernel 12 K (x; z)
produced by the heuristic argument in Section 13.3.1. Then, if opt is a solution
to (154), Lin et al. say that an optimal 2 <N is
1 opt
diag (y1 ; : : : ; yN )

this producing coe¢ cients to be applied to the functions K ( ; xi ). On the


other hand, the heuristic of Section 13.3.1 prescribes that for opt the solution
to problem (155) coe¢ cients in the vector
opt
diag (y1 ; : : : ; yN )

get applied to the functions 12 K (xi ; ). Upon recognizing that opt = 1 opt
it becomes evident that for the choice of C = and kernel 12 K, the heuristic
in Section (13.3.1) produces a solution to the optimization problem (152).39
3 9 Di¤erently put, the "kernel trick" of Section (13.3.1) applied to kernel K with cost para-

meter C solves the present optimization problem applied to kernel (C )2 K with weighting
= C in the problem (152).

176
13.3.3 A Function-Space-Support-Vector-Classi…er Geometry Argu-
ment
A di¤erent line of argument produces a SVM in a way that connects it to the
geometry of support vector classi…cation in <p . The basic idea is to recognize
that one is mapping input feature vectors to an abstract function space A via
the mapping
T (x) ( ) = K (x; )
and that everything subsequent to this mapping can be done fully honoring the
linear space structure. That is, the translation of the support vector classi…er
argument should be in reference to the geometry of A. What one is really de…n-
ing is a classi…er with inputs in A. "Linear classi…cation" in A is the analogue
of support vector classi…cation in <p if one starts from a geometric motivation
like that of the support vector classi…er development. One seeks a unit vector
(now in A) and a constant so that inner products of the (transformed) data
case inputs with the unit vector plus the constant, when multiplied by the yi ,
maximize a margin subject to some relaxed constraints.
All this is writable in terms of A. That is, one wishes to
maximize M
U 2 A with kU kA = 1
and 0 2 <

yi (hT (xi ) ; U iA + 0 ) M (1 i) 8i
subject to PN
for some i 0 with i=1 i C
This is equivalent to the problem
1 2 yi (hT (xi ) ; V iA + 0 ) (1 i) 8i
minimize kV kA subject to PN
V 2A 2 for some i 0 with i=1 i C
and 0 2 <
Then either because optimization over all of A looks too hard, or because
some "Representer Theorem" says that it is enough to do so, one might back
o¤ from optimization over A to optimization over the subspace of spanned by
the set of N elements T (xi ). Then writing
N
X
V = iT (xi )
i=1

so that
N N
1 2 1 XX 1 0
kV kA = i j hT (xi ) ; T (xj )iA = K
2 2 i=1 j=1 2
(again, K is the Gram matrix) the optimization problem becomes
1 yi 0 K i + 0 (1 ) 8i
minimize 0
K subject to PN i
2< N 2 for some i 0 with i=1 i C
and 0 2 <

177
opt opt
where K i is the ith column of the Gram matrix. For and 0 solutions
to the optimization problem and
N
X opt
V opt = i T (xi )
i=1

the voting function for the linear classi…er in A is (for argument W 2 A)


opt
W; V opt A
+ 0

The corresponding voting function for the derived non-linear classi…er on <p
is
XN
opt opt opt opt
T (x) ; V A
+ 0 = i K (x; xi ) + 0
i=1

and one has something very similar to the heuristic application of the "kernel
trick." The question is whether it is exactly equivalent to the use of "the trick."
The problem solved by opt and 0opt is equivalent for some 0 to
N
X 0
1 0 yi Ki + 0 (1 i) 8i
minimize K + subject to
2 <N 2 for some i 0
i=1
and 0 2 <
(156)
Comparison of display (156) to display (153) and consideration of the argument
following statement (153) then shows that there is a choice of C for which
2
when using kernel (1=C ) K the heuristic/"kernel trick" method produces a
solution to the present function-space-support-vector-classi…er problem. This
is the same circumstance as in the penalized …tting function space optimization
argument.40

13.3.4 Some Perspective on SVMs


The "kernelizing" of the support vector classi…er methodology produces a wide
variety of possible classi…ers that can be tuned (via cross-validation) over choice
of kernel (and any parameters it might have) and C or C. As a toy example of
what can result from the technology, consider the situation portrayed in Figure
2
41 with voting functions based on kernels K (x; z) = exp (x z) .
In view of the development here and in Section 1.5.3 what is pictured are
voting functions that are approximations to the optimal 0-1 loss classi…er as
2
linear combinations of the N = 20 radial basis functions exp (x xi )
plus a constant. The = 100 pictures are understandably more wiggly than
the = 10 pictures because of the smaller "bandwidth" of the former basis
4 0 The "kernel trick" of Section (13.3.1) applied to kernel K with cost parameter C solves

the present geometric optimization problem applied to kernel (C )2 K with cost parameter
C .

178
Figure 41: 4 SVM voting functions for a small p = 1 example with N = 20
cases. Red bars on the rug correspond to y = 1 cases and blue bars
correpsond to y = 1 cases. Shown are voting functions based on kernels
2
K (x; z) = exp (x z) . The black bars pointing down indicate support
"vectors."

functions. The C = 1000 pictures are closer to being the "hard margin"
situation and have fewer training case errors in evidence.
Remember in all this, that SVMs built on a kernel K will choose voting
functions that are linear combinations of the functions K (xi ; ), slices of the
kernel at training case inputs. That fact controls what "shapes" are possible
for those voting functions. (In this regard, note that the kernel de…ned by
the ordinary Euclidean inner product, K (x; z) = hx; zi, produces linear voting
functions and thus linear decision boundaries in <p and the special case of
ordinary support vector classi…ers. It is sometimes called the "linear kernel.")
Finally, it is important to keep in mind that to the extent that SVMs produce
good voting functions, those must be equivalent to approximate likelihood ratios.
The discussion of Section 1.5.1 still stands.

13.4 Other Support Vector Methods


Several other issues related to the kind of arguments used in the development
of SV classi…ers are discussed in HTF (and Izenman). One is the matter of
multi-class problems. That is, where G = f1; 2; : : : ; Kg how might one employ
machinery of this kind? There are both heuristic and optimality-based methods
in the literature.

179
A heuristic "one-versus-all" (OVA) strategy might be the following. Invent
2-class problems (K of them), the kth based on

1 if yi = k
yki =
1 otherwise

Then for (a single) C and k = 1; 2; : : : ; K solve the (possibly linearly non-


separable) 2-class optimization problems to produce functions g^k (x) (that would
lead to one-versus-all) classi…ers f^k (x) = sign(^
gk (x)). A possible overall (OVA)
classi…er is then
f^ (x) = arg max g^k (x)
k2G

A second heuristic strategy is to develop a voting scheme based on pair-


wise comparisons. That is, one might invent K 2 problems of classifying class
l versus class m for l < m, choose a single C and solve the (possibly lin-
early non-separable) 2-class optimization problems to produce voting functions
g^lm (x) and corresponding classi…ers f^lm (x) = sign(^
glm (x)). For m > l de…ne
f^ml (x) = f^lm (x) and de…ne an overall "one-versus-one" (OVO) classi…er by
0 1
X
f^ (x) = arg max @ f^km (x)A
k2G
m6=k

or, equivalently
0 1
X h i
f^ (x) = arg max @ I f^km (x) = 1 A
k2G
m6=k

In addition to these fairly ad hoc methods of extending 2-class SVM tech-


nology to K-class problems, there are developments that directly address the
problem (from an overall optimization point of view). Pages 391-397 of Izen-
man provide a nice summary of a 2004 paper of Lee, Lin, and Wabha in this
direction.
Another type of question related to the support vector material is the ex-
tent to which similar methods might be relevant in regression-type prediction
problems. As a matter of fact, there are loss functions alternative to squared
error or absolute error that lead naturally to the use of the kind of technology
needed to produce the SV classi…ers. That is, one might consider so called "
insensitive" losses for prediction like

L1 (b
y ; y) = max (0; jy ybj )

or
2
L2 (b
y ; y) = max 0; (y yb)

and be led to the kind of optimization methods employed in the SVM classi…-
cation context. See Izenman pages 398-401 in this regard.

180
14 Prototype and (More on) Nearest Neighbor
Methods of Classi…cation
We saw when looking at "linear" methods of classi…cation in Section 12 that
these can reduce to classi…cation to the class with …tted mean "closest" in some
appropriate sense to an input vector x. A related notion is to represent classes
each by several "prototype" vectors of inputs, and to classify to the class with
closest prototype. In this section we have these and related nearest neighbor
classi…ers in view.
So consider a K-class classi…cation problem (where y takes values in G =
f1; 2; : : : ; Kg) and suppose that the coordinates of input x have been standard-
ized according to training means and standard deviations.
For each class k = 1; 2; : : : ; K, represent the class by prototypes

z k1 ; z k2 ; : : : ; z kR

belonging to <p and consider a classi…er/predictor of the form

f (x) = arg min min kx z kl k


k l

(that is, one classi…es to the class that has a prototype closest to x).
The most obvious question in using such a rule is "How does one choose the
prototypes?" One standard (admittedly ad hoc, but not unreasonable) method
is to use the so-called "K-means (clustering) algorithm" (see Section 17.2.1) one
class at a time. (The "K" in the name of this algorithm has nothing to do with
the number of classes in the present context. In fact, here the "K" naming
the clustering algorithm is our present R, the number of prototypes used per
class. And the point in applying the algorithm is not so much to see exactly
how training vectors aggregate into "homogeneous" groups/clusters as it is to
…nd a few vectors to represent them.)
For Tk = fxi with corresponding yi = kg an "R "-means algorithm might
proceed by

1. randomly selecting R di¤erent elements from Tk say


(1) (1) (1)
z k1 ; z k2 ; : : : ; z kR

2. then for m = 2; 3; : : : letting


(
(m)
the mean of all xi 2 Tk with
z kl = (m 1) (m 1)
xi z kl < xi z kl0 for all l 6= l0

iterating until convergence.


This way of choosing prototypes for class k ignores the "location" of the other
classes and the eventual use to which the prototypes will be put. A potential
improvement on this is to employ some kind of algorithm (again ad hoc, but

181
reasonable) that moves prototypes in the direction of training input vectors in
their own class and away from training input vectors from other classes. One
such method is known by the name "LVQ"/"learning vector quantization." This
proceeds as follows.
With a set of prototypes (chosen randomly or from an R-means algorithm
or some other way)

z kl k = 1; 2; : : : ; K and l = 1; 2; : : : ; R

in hand, at each iteration m = 1; 2; : : : for some sequence of "learning rates"


f m g with m 0 and m & 0

1. sample an xi at random from the training set and …nd k; l minimizing


kxi z kl k (i.e. …nd the closest prototype z kl )
2. if yi = k (from 1.), update z kl as

z new
kl = z kl + m (xi z kl )

and if yi 6= k (from 1.), update z kl as

z new
kl = z kl m (xi z kl )

iterating until convergence.


As early as Section 1.3.3, we considered nearest neighbor methods. Consider
here again their use in classi…cation problems. As before, de…ne for each x the
l-neighborhood

nl (x) = the set of l inputs xi in the training set closest to x in <p

A nearest neighbor method is to classify x to the class with the largest repre-
sentation in nl (x) (possibly breaking ties at random). That is, de…ne
X
f^ (x) = arg max I [yi = k] (157)
k
xi 2nl (x)

l is a complexity parameter that might be chosen by cross-validation. Properly


implemented this kind of classi…er can be highly e¤ective in spite of the curse
of dimensionality. This often depends upon clever/appropriate/application-
speci…c choice of feature vectors/functions, de…nition of appropriate "distance"
in order to de…ne "closeness" and the neighborhoods, and appropriate local or
global dimension reduction.
A possibility for "local" dimension reduction is this. At x 2 <p one might
use regular Euclidean distance to …nd, say, 50 neighbors of x in the training
inputs to use to identify an appropriate local distance to employ in actually
de…ning the neighborhood nl (x) to be employed in classi…er (157). The fol-
lowing is a DANN (discriminant adaptive nearest neighbor) (squared) metric at
x 2 <p . Let
0
D2 (z; x) = (z x) Q (z x)

182
for
1 1 1 1
Q=W 2 W 2 BW 2 + I W 2

1 1
=W 2 (B + I) W 2 (158)

for some > 0 where W is a pooled within-class sample covariance matrix


K
X
W = ^k W k
k=1
K
X X
1 0
= ^k (xi xk ) (xi xk )
nk 1
k=1

(xk is the average xi from class k in the 50 used to create the local metric), B
is a weighted between-class covariance matrix of sample means
K
X 0
B= ^k (xk x) (xk x)
k=1

(x is a (probably weighted) average of the xk ) and


1 1
B =W 2 BW 2

1
Notice that in form (158), the "outside" W 2 factors "sphere" (z x) di¤er-
ences relative to the within-class covariance structure. B is then the between-
class covariance matrix of sphered sample means. Without the I, the distance
would then discount di¤erences in the directions of the eigenvectors correspond-
ing to large eigenvalues of B (allowing the neighborhood de…ned in terms of
D to be severely elongated in those directions). The e¤ect of adding the I
term is to limit this elongation to some degree, preventing xi "too far in terms
of Euclidean distance from x" from being included in nl (x).
A global use of the DANN kind of thinking might be to do the following.
At each training input vector xi 2 <p , one might again use regular Euclidean
distance to …nd, say, 50 neighbors and compute a weighted between-class-mean
covariance matrix B i as above (for that xi ). These might be averaged to
produce
N
1 X
B= Bi
N i=1

Then for eigenvalues of B, say 1 2 p 0 with corresponding


(unit) eigenvectors u1 ; u2 ; : : : ; up one might do nearest neighbor classi…cation
based on the …rst few features
vj = u0j x
and ordinary Euclidean distance. This is a nearest neighbor version of the
reduced rank classi…cation idea …rst met in the discussion of linear classi…cation
in Section 12.

183
15 Reproducing Kernel Hilbert Spaces: Penal-
ized/Regularized and Bayes Prediction
A general framework that uni…es many interesting regularized …tting methods
is that of reproducing kernel Hilbert spaces (RKHSs). There is a very nice 2012
Statistics Surveys paper by Nancy Heckman (related to an older UBC Statistics
Department technical report (#216)) entitled "The Theory and Application of
Penalized Least Squares Methods or Reproducing Kernel Hilbert Spaces Made
Easy," that is the nicest exposition I know about of the connection of this
material to splines. Parts of what follows are borrowed shamelessly from her
paper. There is also some very helpful stu¤ in CFZ Section 3.5 and scattered
through Izenman about RKHSs.41

15.1 RKHSs and p = 1 Cubic Smoothing Splines


To provide motivation for a somewhat more general discussion, consider again
the smoothing spline problem. We consider here the function space
Z 1
2
A = h : [0; 1] ! <j h and h0 are absolutely continuous and (h00 (x)) dx < 1
0

as a Hilbert space (a linear space with inner product where Cauchy sequences
have limits) with inner product
Z 1
hf; giA f (0) g (0) + f 0 (0) g 0 (0) + f 00 (x) g 00 (x) dx
0

1=2
(and corresponding norm khkA = hh; hiA ).
With this de…nition of inner
product and norm, for x 2 [0; 1] the (linear) functional (a mapping A ! <)

Fx [f ] f (x)

is continuous. Thus the so-called "Riesz representation theorem" says that


there is an Rx 2 A such that

Fx [f ] = hRx ; f iA = f (x) 8f 2 A

(Rx is called "the representer of evaluation at x.") It is in fact possible to verify


that for z 2 [0; 1]
Rx (z) = 1 + xz + R1x (z)
for
x+z 2 1 3
R1x (z) = xz min (x; z) (min (x; z)) + (min (x; z))
2 3
does this job.
4 1 See also "Penalized Splines and Reproducing Kernels" by Pearce and Wand in The Amer-

ican Statistician (2006) and "Kernel Methods in Machine Learning" by Hofmann, Scholköpf,
and Smola in The Annals of Statistics (2008).

184
Then the function of two variables de…ned by

R (x; z) Rx (z)

is called the reproducing kernel for A, and A is called a reproducing kernel


Hilbert space (RKHS) (of functions).
De…ne a linear di¤erential operator on elements of A (a map from A to some
appropriate function space) by

L [f ] (x) = f 00 (x)

Then the optimization problem solved by the cubic smoothing spline is mini-
mization of
XN Z 1
2 2
(yi Fxi [h]) + (L [h] (x)) dx (159)
i=1 0

over choices of h 2 A.
It is possible to show that the minimizer of the quantity (159) is necessarily
of the form
N
X
h (x) = 0 + 1 x + i R1xi (x)
i=1

and that for such h, the criterion (159) is of the form


0 0
(Y T K ) (Y T K )+ K

for
0
T = (1; X) ; = ; and K = (R1xi (xj ))
N 2 1

So this has ultimately produced a matrix calculus problem.

15.2 What is Possible Beginning from Linear Functionals


and Linear Di¤erential Operators for p = 1
For constants di > 0; functionals Fi , and a linear di¤erential operator L de…ned
for continuous functions wk (x) by
m
X1
L [h] (x) = h(m) (x) + wk (x) h(k) (x)
k=1

Heckman considers the minimization of


N
X Z b
2 2
di (yi Fi [h]) + (L [h] (x)) dx (160)
i=1 a

in the space of functions


derivatives of h up to order m 1 are
A= h : [a; b] ! <j Rb 2
absolutely continuous and a h(m) (x) dx < 1

185
One may adopt an inner product for A of the form
m
X1 Z b
hf; giA f (k) (a) g (k) (a) + L [f ] (x) L [g] (x) dx
k=0 a

and have a RKHS. The assumption is made that the functionals Fi are con-
tinuous and linear, and thus that they are representable as Fi [h] = hfi ; hiA for
some fi 2 A. An important special case is that where Fi [h] = h (xi ), but
Rb
other linear functionals have been used, for example Fi [h] = a Hi (x) h (x) dx
for known Hi .
The form of the reproducing kernel implied by the choice of this inner prod-
uct is derivable as follows. First, there is a linearly independent set of functions
fu1 ; : : : ; um g that is a basis for the subspace of A consisting of those elements
h for which L [h] = 0 (the zero function). Call this subspace A0 . The so-called
Wronskian matrix associated with these functions is then
(j 1)
W (x) = ui (x)
m m

With
0 1
C = W (a) W (a)
let X
R0 (x; z) = Cij ui (x) uj (z)
i;j

Further, there is a so-called Green’s function associated with the operator L,


a function G (x; z) such that for all h 2 A satisfying h(k) (a) = 0 for k =
0; 1; : : : ; m 1
Z b
h (x) = G (x; z) L [h] (z) dz
a
Let Z b
R1 (x; z) = G (x; t) G (z; t) dt
a
The reproducing kernel associated with the inner product and L is then

R (x; z) = R0 (x; z) + R1 (x; z)

As it turns out, P A0 is a RKHS with reproducing kernel R0 under the inner


m 1 (k)
product hf; gi0 = k=0 f (a) g (k) (a). Further, the subspace of A con-
(k)
sisting of those h with h (a) = 0 for k = 0; 1; : : : ; m 1 (call it A1 ) is
a RKHS with reproducing kernel R1 (x; z) under the inner product hf; gi1 =
Rb
a
L [f ] (x) L [g] (x) dx. Every element of A0 is perpendicular to every element
of A1 in A and every h 2 A can be written uniquely as h0 + h1 for an h0 2 A0
and an h1 2 A1 .

186
These facts can be used to show that a minimizer of quantity (160) exists
and is of the form
m
X N
X
h (x) = k uk (x) + i R1 (xi ; x)
k=1 i=1

For h (x) of this form, loss plus penalty (160) is of the form
0 0
(Y T K ) D (Y T K )+ K

for
0 1
1
B C
B 2 C
T = (Fi [uj ]) ; =B .. C ; D = diag (d1 ; : : : ; dm ) ; and K = (Fi [R1 ( ; xj )])
@ . A
m

and its optimization is a matrix calculus problem. In the important special


case where the Fi are function evaluation at xi , above

Fi [uj ] = uj (xi ) and Fi [R1 ( ; xj )] = R1 (xi ; xj )

15.3 What Is Common Beginning Directly From a Kernel


Another way that it is common to make use of RKHSs is to begin with a kernel
and its implied inner product (rather than deriving one as appropriate to a
particular well-formed optimization problem in function spaces). The 2006
paper of Pearce and Wand in The American Statistician is a readable account
of this thinking that is more complete than what follows here (and provides
some references).
This development begins with C a compact subset of <p and a symmetric
kernel function
K:C C!<
Ultimately, we will considerP as predictors for x 2 C linear combinations of sec-
N
tions of the kernel function, i=1 bi K (x; xi ) (where the xi are the input vectors
in the training set). But to get there in a rational way, and to incorporate use of
a complexity penalty into the …tting, we will restrict attention to those kernels
that have nice properties. In particular, we require that K be continuous and
non-negative de…nite.
Then according to what is known as Mercer’s Theorem K may then be
written in the form
X1
K (z; x) = i i (z) i (x) (161)
i=1

for linearly independent (in L2 (C)) functions f i g, and constants i 0,


where the i comprise an orthonormal basis for L2 (C) (so any function in

187
P1
fR 2 L2 (C) has an expansion in terms of the i as i=1 hf; i i2 i for hf; hi2
C
f (x) h (x) dx ), and the ones corresponding to positive i may be taken
to be continuous on C. Further, the i are "eigenfunctions" of the kernel K
corresponding to the "eigenvalues i " in the sense that in L2 (C)
Z
i (z) K (z; ) dz = i i ( )

Our present interest is in a function space A (that is a subset of L2 (C) with


a di¤erent inner product and norm) with members of the form
1
X 1 2
X c i
f (x) = ci i (x) for ci with <1 (162)
i=1 i=1 i

(called the "primal form" of functions


P1 in the space). (Notice
P1 that all elements
of L2 (C) are of the form f (x) = i=1 ci i (x) with i=1 c2i < 1.) More
naturally, our interest centers on
1
X
f (x) = bi K (x; z i ) for some set of z i (163)
i=1

supposing that the series converges appropriately (called the "dual form" of
functions in the space). The former is most useful for producing simple proofs,
while the second is most natural for application, since how to obtain the i and
corresponding i for a given K is not so obvious. Notice that
1
X
K (z; x) = i i (z) i (x)
i=1
X1
= ( i i (x)) i (z)
i=1
P1 P1
and letting i i (x) = ci (x), since i=1 c2i (x) = i = i=1 i 2i (x) = K (x; x) <
1, the function K ( ; x) is of the form (162),
Pso that we can expect functions of
1
the form (163) with absolutely convergent i=1 bi to be of form (162).
In the space of functions (162), we de…ne an inner product (for our Hilbert
space) *1 +
X 1
X X1
ci di
ci i ; di i
i=1 i=1 i=1 i
A
so that
1 2 1 2
X X c i
ci i
i=1 i=1 i
A
P1
Note then that for f = i=1 ci i belonging to the Hilbert space A,
1
X 1
X
ci i i (x)
hf; K ( ; x)iA = = ci i (x) = f (x)
i=1 i i=1

188
and so K ( ; x) is the representer of evaluation at x. Further,
hK ( ; z) ; K ( ; x)iA K (z; x)
which is the reproducing property of the RKHS.
Notice also that for two linear combinations of slices of the kernel function
at some set of (say, M ) inputs fz i g (two functions in A represented in dual
form)
M
X M
X
f()= ci K ( ; z i ) and g ( ) = di K ( ; z i )
i=1 i=1
the corresponding A inner product is
M
* M
+
X X
hf; giA = ci K ( ; z i ) ; dj K ( ; z j )
i=1 j=1 A
M X
X M
= ci dj K (z i ; z j )
i=1 j=1
0 1
d1
B C
= (c1 ; : : : ; cM ) (K (z i ; z j )) i=1;:::;M @ ... A
j=1;:::;M
dM
This is a kind of <M inner product of the coe¢ cient vectors c0 = (c1 ; : : : ; cM )
and d0 = (d1 ; : : : ; dM ) de…ned by the nonnegative de…nite matrix (K (z i ; z j )).
Further, if a random M -vector Y has covariance matrix (K (z i ; z j )), this is
2
Cov c0 Y ; d0 Y . So, in particular, for f of this form kf kA = hf; f iA =Var(c0 Y ).
For applying this material to the …tting of training data, for > 0 and a
loss function L (b y ; y) 0 de…ne an optimization criterion
N
!
X 2
minimize L (f (xi ) ; yi ) + kf kA (164)
f 2A
i=1

As it turns out, an optimizer of this criterion must, for the training vectors fxi g,
be of the form
XN
^
f (x) = bi K (x; xi ) (165)
i=1
2
and the corresponding f^ is then
A

D E N X
X N
f^; f^ = bi bj K (xi ; xj )
A
i=1 j=1

The criterion (164) is thus


0 0 1 1
N
X XN
minimize @ L@ bj K (xi ; xj ) ; yi A + b0 (K (xi ; xj )) bA (166)
b2<N
i=1 j=1

189
With the Gram matrix K = (K (xi ; xj )) and P = K a symmetric general-
ized inverse of K and de…ning
0 1
N
X N
X
LN (Kb; Y ) L@ bj K (xi ; xj ) ; yi A
i=1 j=1

the optimization criterion (166) is

minimize LN (Kb; Y ) + b0 Kb
b2<N

i.e.
minimize LN (Kb; Y ) + b0 K 0 P Kb
b2<N

i.e.
minimize (L (v; Y ) + v 0 P v) (167)
v2C(K)

That is, the function space optimization problem (164) reduces to the N -
dimensional optimization problem (167). A v 2 C (K) (the column space of
K) minimizing LN (v; Y )+ v 0 P v corresponds to b minimizing LN (Kb; Y )+
b0 Kb via
Kb = v (168)
2
For the particular special case of squared error loss, L (b
y ; y) = (y yb) , this
development has a very explicit punch line. That is,
0
LN (Kb; Y ) + b0 Kb = (Y Kb) (Y Kb) + b0 Kb

Some vector calculus shows that this is minimized over choices of b by


1
b = (K + I) Y (169)

and corresponding …tted values are

Yb = v = K (K + I)
1
Y

Then using fact (169) under squared error loss, the solution to problem (164) is
from expression (165)
XN
f^ (x) = b i K (x; xi ) (170)
i=1

To better understand the nature of (167), consider the eigen decomposition


of K in a case where it is non-singular as
0
K = U diag ( 1 ; 2; : : : ; N ) U

190
for eigenvalues 1 2 N > 0, where the eigenvector columns of U
comprise an orthonormal basis for <N . The penalty in form (167) is
0
v 0 P v = v 0 U diag (1= 1 ; 1= 2 ; : : : ; 1= N) U v
N
X 1 2
= hv; uj i
j=1 j

Now (remembering that the uj comprise an orthonormal basis for <N )


N
X N
X
2 2
v= hv; uj i uj and kvk = hv; vi = hv; uj i
j=1 j=1

so we see that in choosing a v to optimize LN (v; Y )+ v 0 P v, we penalize highly


those v with large components in the directions of the late eigenvectors of K
(the ones corresponding to its small eigenvalues) thereby tending to suppress
those features of a potential v.
HTF seem to say that the eigenvalues and eigenvectors of the data-dependent
N N matrix K are somehow related respectively to the constants i and
functions i in the representation (161) of K as a weighted series of products.
That seems hard to understand and (even if true) certainly not obvious.
CFZ provide a result summarizing the most general available version of this
development, known as "the representer theorem." It says that if : [0; 1) !
< is strictly increasing and L ((x1 ; y1 ; h (x1 )) ; : : : ; (xN ; yN ; h (xN ))) 0 is an
arbitrary loss function associated with the prediction of each yi as h (xi ), then
an h 2 A minimizing
L ((x1 ; y1 ; h (x1 )) ; (x2 ; y2 ; h (x2 )) ; : : : ; (xN ; yN ; h (xN ))) + (khkA )
has a representation as
N
X
h (x) = i K (x; xi )
i=1
Further, if f 1 ; 2 ; : : : ; M g is a set of real-valued functions and the N M
matrix ( j (xi )) is of rank M then for h0 2 spanf 1 ; 2 ; : : : ; M g and h1 2 A;
an h = h0 + h1 minimizing
L ((x1 ; y1 ; h (x1 )) ; (x2 ; y2 ; h (x2 )) ; : : : ; (xN ; yN ; h (xN ))) + (kh1 kA )
has a representation as
M
X N
X
h (x) = i i (x) + i K (x; xi )
i=1 i=1

The extra generality provided by this theorem for the squared error loss case
treated above is that it provides for linear combinations of the functions i (x)
to be unpenalized in …tting. Then for
=( j (xi ))
N M

191
and
0 1 0
R= I Y
1
an optimizing is ^ = 0 0
Y and ^ optimizes
0 0
(R K ) (R K )+ K
1
and the earlier argument implies that ^ = (K + I) R.

15.3.1 Reprise of Some Special Cases


Here we brie‡y consider special cases of this development, making use of kernel
functions introduced as early as Section 1.4.3, beginning with the standard
kernel in p dimensions
d
K (z; x) = (1 + hz; xi)
For …xed xi the basis functions K (x; xi ) are dth order polynomials in the entries
of xi . So the …tting is in terms of such polynomials. Note that since K (z; x)
is relatively simple here, there seems to be a good chance of explicitly deriving
a representation (161) and perhaps working out all the details of what is above
in a very concrete setting.
Another standard kernel function in p dimensions is
2
K (z; x) = exp kz xk

and the basis functions K (x; xi ) are essentially spherically symmetric normal
pdfs with mean vectors xi . (These are "Gaussian radial basis functions" and for
p = 2, functions (165) produce prediction surfaces in 3-space that have smooth
symmetric "mountains" or "craters" at each xi ; of elevation or depth relative
to the rest of the surface governed by bi and extent governed by .)
Of course, Section 1.4.3 provides a number of insights that enable the cre-
ation of a wide variety of kernels beyond the few mentioned here.
Then, for example, the standard development of so-called "support vector
classi…ers" in a 2-class context with y taking values 1, uses some kernel K (z; x)
and voting function
1
X
g (x) = b0 + bi K (x; xi )
i=1

in combination with loss

L (g (x) ; y) = [1 yg (x)]+

(the sign of g (x) providing the classi…cation associated with x).

192
15.3.2 Addendum Regarding the Structures of the Spaces Related
to a Kernel
An ampli…cation of some aspects of the basic description provided above for the
RKHS corresponding to a kernel K : C C ! < is as follows.
From the representation
1
X
K (z; x) = i i (z) i (x)
i=1

write
p
i = i i

so that the kernel is


1
X
K (z; x) = i (z) i (x) (171)
i=1

(Note that considered as functions in L2R (C) the i are orthogonal, but not
2
generally orthonormal, since h i ; i i2 C i i
(x) dx = i which is typically
not 1.) Representation (171) suggests that one think about the inner product
for inputs provided by the kernel in terms of a transform of an input vector
x 2 <p to an in…nite-dimensional feature vector

(x) = ( 1 (x) ; 2 (x) ; : : :)

and then "ordinary <1 inner products" de…ned on those feature vectors.
The function space A has members of the (primal) form
1
X 1 2
X c i
f (x) = ci i (x) for ci with <1
i=1 i=1 i

This is perhaps more naturally


1
X 1
X p 2 1
X
ci i
f (x) = ci i (x) for ci with = c2i < 1
i=1 i=1 i i=1
P1
(Again,
P1 2 all elements of L2 (C) are of the form f (x) = i=1 ci i (x) with
i=1 ci < 1.) The A inner product of two functions of this primal form
has been de…ned as
*1 1
+ 1
X X X ci di
ci i ; di i
i=1 i=1 i=1 i
A
* 1 1
+
X ci X di
= p i ; p i
i=1 i i=1 i
A
c1 c2 d1 d2
= p ; p ;::: ; p ; p ;:::
1 2 1 2 <1

193
So, two
P1elements of P A written in terms of the i (instead of their multiples i )
1
say i=1 ci i and i=1 di i have A inner product that is the ordinary <1
inner product of their vectors of coe¢ cients.
Now consider the function
X1
K ( ; x) = i ( ) i (x)
i=1
P1
For f = i=1 ci i 2 A,
* 1
+ 1
X X
hf; K ( ; x)iA = ci i ; K ( ; x) = ci i (x) = f (x)
i=1 A i=1

and (perhaps more clearly than above) indeed K ( ; x) is the representer of eval-
uation at x in the function space A.

15.4 Gaussian Process "Priors," Bayes Predictors, and


RKHSs
The RKHS material has an interesting connection to Bayes prediction. It’s our
purpose here to show that connection. Consider an application of essentially
Bayesian thinking to the development of a predictor based the use of a Gaussian
process as a more or less non-parametric "prior distribution" for the function
(of x) E[yjx]. That is, for purposes of developing a predictor, suppose that one
assumes that
y = (x) +
where
(x) = (x) + (x)
E = 0; Var = 2 , the function (x) is known (it could be taken to be identi-
cally 0) and plays the role of a prior mean for the function (of x)
(x) = E [yjx]
and (independent of errors ), (x) is a realization of a mean 0 stationary
Gaussian process on <p , this Gaussian process describing the prior uncertainty
for (x) around (x). More explicitly, the assumption on (x) is that E (x) =
0 and Var (x) = 2 for all x, and for some appropriate (correlation ) function
, Cov( (x) ; (z)) = 2 (x z) for all x and z ( (0) = 1 and the function of
two variables (x z) must be positive de…nite). The "Gaussian" assumption
is then that for any …nite set of elements of <p , say z 1 ; z 2 ; : : : ; z M , the vector
of corresponding values (z i ) is multivariate normal.
There are a number of standard forms that have been suggested for the
correlation function . The simplest ones are of a product form, i.e. if j is a
valid one-dimensional correlation function, then the product
p
Y
(x z) = j (xj zj )
j=1

194
is a valid correlation function for a Gaussian process on <p . Standard forms
for correlation functions in one dimension are ( ) = exp c 2 and ( ) =
exp ( c j j).42 The …rst produces "smoother" realizations than does the sec-
ond, and in both cases, the constant c governs how fast realizations vary.
One may then consider the joint distribution (conditional on the xi and
assuming that for the training values yi the i are iid independent of the (xi ))
of the training output values and a value of (x). From this, one can …nd the
conditional mean for (x) given the training data. To that end, let
2
= (xi xj ) i=1;2;:::;N
N N j=1;2;:::;N

Then for a single value of x,


0 1 00 1 1
y1 (x1 )
B y2 C BB (x2 ) C C
B C BB C + 2I (x) C
B .. C BB .. C C
B . C MVNN +1 BB . C; 0 2 C
B C BB C (x) C
@ yN A @@ (xN ) A A
(x) (x)

for 0 1
2
(x x1 )
B 2
(x x2 ) C
B C
(x) = B .. C
N 1 @ . A
2
(x xN )
Then standard multivariate normal theory says that the conditional mean of
(x) given Y is
0 1
y1 (x1 )
B (x2 ) C
0 1 B y2 C
f^ (x) = (x) + (x) + 2I B .. C (172)
@ . A
yN (xN )

Write 0 1
y1 (x1 )
B y2 (x2 ) C
2 1B C
w = + I B .. C (173)
N 1 @ . A
yN (xN )
and then note that form (172) implies that
N
X
f^ (x) = (x) + wi 2
(x xi ) (174)
i=1
4 2 See Section 1.4.3 for other p = 1 bounded non-negative de…nite functions that can be

used to create correlation functions.

195
and we see that this development ultimately produces (x) plus a linear com-
bination of the "basis functions" 2 (x xi ) as a predictor. Remembering
that 2 (x z) must be positive de…nite and seeing the ultimate form of the
predictor, we are reminded of the RKHS material.
In fact, consider the case where (x) 0. (If one has some non-zero prior
mean for (x), arguably that mean function should be subtracted from the
raw training outputs before beginning the development of a predictor. At a
minimum, output values should probably be centered before attempting devel-
opment of a predictor.) Compare displays (173) and (174) to displays (169)
and (170) for the (x) = 0 case. What is then clear is that the present
"Bayes" Gaussian process development of a predictor under squared error loss
based on a covariance function 2 (x z) and error variance 2 is equivalent
to a RKHS regularized …t of a function to training data based on a kernel
K (x; z) = 2 (x z) and penalty weight = 2 .

16 More on Understanding and Predicting Pre-


dictor Performance
There are a variety of theoretical and empirical quantities that might be com-
puted to quantify predictor performance. Those that are empirical and reliably
track important theoretical ones might potentially be used to select an e¤ective
predictor. We’ll here consider some of those (theoretical and empirical) mea-
sures. We will do so in the by-now-familiar setting where training data (x1 ; y1 ) ;
(x2 ; y2 ) ; : : : ; (xN ; yN ) are assumed to be iid P and independent of (x; y) that
is also P distributed, and are used to pick a prediction rule f^ (x) to be used
under a loss L (b y ; y) 0.
What is very easy to think about and compute is the training error
N
1 X
err = L f^ (xi ) ; yi
N i=1

This typically decreases with increased complexity in the form of f , and is no


reliable indicator of predictor performance o¤ the training set. Measures of
prediction rule performance o¤ the training set must have a theoretical basis
(or be somehow based on data held back from the process of prediction rule
development).
General loss function versions of (squared error loss) quantities related to
err de…ned in Section ??, are
h i
Err (x) ET E L f^ (x) ; y jx

ErrT E(x;y) L f^ (x) ; y (175)


and
Err Ex Err (x) = ET ErrT (176)

196
A slightly di¤erent and semi-empirical version of this expected prediction error
(176) is the "in-sample" test error (7.12) of HTF
N
1 X
Err (xi )
N i=1

16.1 Optimism of the Training Error


Typically, err is less than ErrT . Part of the di¤erence in these is potentially
due to the fact that ErrT is an "extra-sample" error, in that the averaging in
(175) is potentially over values x outside the set of values in the training data.
We might consider instead
N
1 X yi
ErrT in = E L f^ (xi ) ; yi (177)
N i=1

where the expectations indicated in form (177) are over yi Pyjx=xi (the
^
entire training sample used to choose f , both inputs and outputs, is being held
constant in the averaging in display (177)). The di¤erence

op = ErrT in err

is called the "optimism of the training error." HTF use the notation

! = EY op = EY (ErrT in err) (178)

where the averaging indicated by EY is over the outputs in the training set
(using the conditionally independent yi s, yi Pyjx=xi ). HTF say that for
many losses
N
2 X
!= CovY (b yi ; yi ) (179)
N i=1
For example, consider the case of squared error loss. There

! = EY (ErrT in err)
N
X N
1 2 1 X 2
= EY EY yi f^ (xi ) EY yi f^ (xi )
N i=1
N i=1
N
X
2
= EY yi f^ (xi ) EY EY yi f^ (xi )
N i=1
N
2 X Y ^
= E f (xi ) (yi E [yjx = xi ])
N i=1
N
2 X
= CovY (b
yi ; yi )
N i=1

197
We note that in this context, assuming that given the xi in the training data the
outputs are uncorrelated and have constant variance 2 , by relationship (57)

2 2
!= df Yb
N

16.2 Cp , AIC and BIC


The fact that
ErrT in = err + op
suggests the making of estimates of ! = EY op and the use of

b
err + ! (180)

as a guide in model selection. This idea produces consideration of the model


selection criteria Cp /AIC and BIC.

16.2.1 Cp and AIC


For the situation of least squares …tting with p predictors or basis functions and
squared error loss,
1
df Yb = p = tr X X 0 X X0
PN
so that i=1 CovY (b yi ; yi ) = p 2 . Then, if c2 is an estimated error variance
based on a low-bias/high-number-of-predictors …t, a version of quantity (180)
suitable for this context is Mallows’Cp
2p c2
Cp err +
N
In a more general setting, if one can appropriately evaluate or estimate
PN Y
i=1 Cov (byi ; yi ) = df Yb 2
, a general version of quantity (180) becomes the
Akaike information criterion
N
2 X 2 2
AIC = err + CovY (b
yi ; yi ) = err + df Yb
N i=1 N

16.2.2 BIC
For situations where …tting is done by maximum likelihood, the Bayesian Infor-
mation Criterion of Schwarz is an alternative to AIC. That is, where the joint
distribution P produces density P (yj ; x) for the conditional distribution of yjx
and b is the maximum likelihood estimator of , a (maximized) log-likelihood
is
XN
loglik = log P yi jb; xi
i=1

198
and the so-called Bayesian information criterion

BIC = 2 loglik + (log N ) df Yb

2
For yjx normal with variance , up to a constant, this is
2
N (log N )
BIC = 2
err + df Yb
N
and after switching 2 for log N , BIC is a multiple of AIC. The replacement of
2 with log N means that when used to guide model/predictor selections, BIC
will typically favor simpler models/predictors than will AIC.
The Bayesian origins of BIC can be developed as follows. Suppose (as in
Section 11.1) that M models are under consideration, the mth of which has
parameter vector m and corresponding density for training data

fm (T j m)

with prior density for m


gm ( m)

and prior probability for model m

(m)

With this structure, the posterior distribution of the model index is


Z
(mjT ) / (m) fm (T j m ) gm ( m ) d m

Under 0-1 loss and uniform ( ), one wants to choose model m maximizing
Z
fm (T j m ) gm ( m ) d m = fm (T ) = the mth marginal of T

The so-called Laplace approximation says that


dm
log fm (T ) log fm T j c
m log N + O (1)
2
where dm is the real dimension of m . Assuming that the marginal of x
doesn’t change model-to-model or parameter-to-parameter, log fm T j c m is
loglik +CN , where CN is a function of only the input values in the training set.
Then

2 log fm (T ) 2 loglik + (log N ) dm + O (1) 2CN


= BIC + O (1) 2CN

and (at least approximately) choosing m to maximize fm (T ) is choosing m to


minimize BIC.

199
16.3 Cross-Validation Estimation of Err
K-fold cross-validation is described in Section 1.3.6. One hopes that
N
^ 1 X
CV f = L f^k(i) (xi ) ; yi )
N i=1

estimates Err. In predictor selection, say where predictor f^ has a complexity


parameter , it is common to look at

CV f^

as a function of , try to optimize, and then re…t (with that ) to the whole
training set.
K-fold cross-validation can be expected to estimate Err for

1
"N " = 1 N
K

The question of how cross-validation might be expected to do is thus related


to how Err changes with N (the size of the training sample). The statistical
folklore is that typically Err decreases monotonically in N approaching some
limiting value as N goes to in…nity. The "early" (small N ) part of the "Err vs
N curve" is steep and the "late" part (large N ) is relatively ‡at. If (1 1=K) N
is large enough that at such size of the training dataset, the curve is ‡at, then
the e¤ectiveness of cross-validation is limited only by the noise inherent noise
in estimating it, and not by the fact that training sets of size (1 1=K) N are
not of size N . Operationally, K = 5 or 10 seems standard, though as discussed
in Section ?? there is recent evidence in favor of using K = N , i.e., LOOCV.
HTF say that for many linear …tting methods (that produce Yb = M Y )
including least squares projection and cubic smoothing splines, the N = K
(leave one out) cross-validation error is (for f^i produced by training on T
f(xi ; yi )g)

N N
!2
1 X 2 1 X yi f^ (xi )
CV f^ = yi f^i (xi ) =
N i=1 N i=1 1 Mii

(for Mii the ith diagonal element of M ). The so-called generalized cross-
validation approximation to this is the much more easily computed
N
!2
1 X yi f^ (xi )
GCV f^ =
N i=1 1 tr (M ) =N
err
= 2
(1 tr (M ) =N )

200
2
It is worth noting (per HTF Exercise 7.7) that since 1= (1 x) 1 + 2x for
x near 0,
N
!2
1 X yi f^ (xi )
GCV f^ =
N i=1 1 tr (M ) =N
N N
!
1 X 2 2 1 X 2
yi f^ (xi ) + tr (M ) yi f^ (xi )
N i=1 N N i=1

which is close to AIC, the di¤erence being that here 2 is being estimated based
on the model being …t, as opposed to being estimated based on a low-bias/large
model.

16.4 Bootstrap Estimation of Err


Suppose that the values of the input vectors in the training set are unique. One
might make B bootstrap samples of N (random samples with replacement of size
N ) from the training set T , say T 1 ,T 2 ; : : : ; T B , and train on these bootstrap
samples to produce predictors, say

predictor f^ b based on T b

Let C i be the set of indices b = 1; 2; : : : ; B for which (xi ; yi ) 2


= T b . A possible
bootstrap estimate of Err is then
N
" #
(1) 1 X 1 X
d
Err L f^ (xi ) ; yi
b
N i=1 jC i j i b2C

It’s not completely clear what to make of this. For one thing, the T b rarely
have N distinct elements. In fact, the expected number of distinct cases in
a bootstrap sample for N of any appreciable size is about :632N . So roughly
(1)
d to estimate Err at :632N , not at N . So unless
speaking, we might expect Err
Err as a function of training set size is fairly ‡at to the right of :632N , one might
expect substantial positive bias in it as an estimate of Err (at N ).
HTF argue for
(:632) (1)
d
Err d
:368 err + :632 Err
as a …rst order correction on the biased bootstrap estimate, but admit that
this is not perfect either, and propose a more complicated …x (that they call
(:632+)
d
Err ) for classi…cation problems.

201
Part V
Unsupervised Learning Methods
17 Some Methods of Unsupervised Learning
As we said in Section 1.1, "supervised learning" is basically prediction of y
belonging to < or some …nite index set from a p-dimensional x with coordinates
each individually in < or some …nite index set, using training data pairs

(x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xN ; yN )

to create an e¤ective prediction rule

yb = fb(x)

This is one kind of discovery and exploitation of structure in the training data.
As we also said in Section 1.1, "unsupervised learning" is discovery and
quanti…cation of structure in
0 0 1
x1
B x02 C
B C
X =B . C
N p @ .. A
x0N

without reference to some particular coordinate of a p-dimensional x as an


object of prediction. There are a number of versions of this problem in Ch 14
of HTF that we will outline here.

17.1 Association Rules/Market Basket Analysis


Suppose that one is presented with a database representing N transactions, each
of which may or may not include each one of items

s1 ; s2 ; : : : ; sp
p
so that one could think of x taking values in f0; 1g , xj = 1 indicating presence
of item j in the transaction. For two disjoint sets of items

S1 = fs11 ; s12 ; : : : ; s1k1 g and S2 = fs21 ; s22 ; : : : ; s2k2 g

consider transactions that

1. include all items in S1 ,


2. include all items in S2 , or
3. include all items in S = S1 [ S2 .

202
In applications of this formalism to "market-basket analysis" it is common
to call S; S1 ; and S2 item sets and the statement

"the transaction includes all of both item set S1 and item set S2 "

a conjunctive rule. It is then common to further talk about association


rules of the form
S1 =) S2 (181)
and to consider quantitative measures associated with them. In framework
(181), S1 is called the antecedent and S2 is called the consequent in the rule.
De…ne indicator variables

Iij = I [transaction i includes all of item set Sj ]

for i = 1; : : : ; N and j = 1; 2: For the association rule S1 =) S2 ,

1. the support of the rule (also the support of the item set S) is
N
1 X
Ii1 Ii2
N i=1

(the relative frequency with which the full item set is seen in the data-
base/training cases),

2. the con…dence or predictability of the rule is


PN
i=1 Ii1 Ii2
PN
i=1 Ii1

(the relative frequency with which the full item set S is seen in the training
cases that exhibit the smaller item set S1 ),
3. the "expected con…dence" of the rule is
N
1 X
Ii2
N i=1

(the relative frequency with which item set S2 is seen in the training cases),
and
4. the lift of the rule is
PN
conf idence N i=1 Ii1 Ii2
= P PN
expected conf idence N
i=1 Ii1 i=1 Ii2

(a measure of association).

203
If one thinks of the cases in the training set as a random sample from some
distribution on item sets (equivalently, a distribution for x), lets I1 stand for
the event that all items in S1 are in the set, I2 stand for the event that all items
in S2 are in the set, and I stand for the event that all items in S = S1 [ S2 are
in the set, then

1. the support of the rule is an estimate of P (I),


2. the con…dence is an estimate of P (I2 jI1 ),

3. the expected con…dence is an estimate of P (I2 ),


4. and the lift is an estimate of the ratio P (I1 and I2 ) = (P (I1 ) P (I2 )).

The basic thinking about association rules seems to be that usually (but
perhaps not always) one wants rules with large support (so that the estimates
can be reasonably expected to be reliable). Further, one then wants large
con…dence or lift, as these indicate that the corresponding rule will be useful in
terms of understanding how the coordinates of x (presence or absence of various
items) are related in the database/training data. Apparently, standard practice
is to identify a large number of promising item sets and association rules, and
make a database of association rules that can be queried in searches like:

"Find all rules in which YYY is the consequent that have con…dence
over 70% and support more than 1%."

Basic questions that we have to this point not addressed are where one gets
appropriate item sets S and how one uses them to produce (S1 and S2 and)
corresponding association rules. In answer to the second of these questions,
one might say "consider all 2jSj 2 association rules that can be associated with
a given item set." But what then are "interesting" item sets S or how does
one …nd a potentially useful set of such? We proceed to brie‡y consider these
issues.

17.1.1 The "Apriori Algorithm" and Use of its Output


One standard way of generating item sets (to process into association rules)
is to use the so-called "apriori algorithm." This produces all item sets S of
support at least t. (These can then be examined to …nd potentially interesting
association rules by breaking them into two pieces S1 and S2 ).
This operates as follows.

1. Pass through all p items


s1 ; s2 ; : : : ; sp
identifying those sj that individually have support/prevalence

1
# fi j xij = 1g
N

204
at least t and place them in the set
S1t = fitem sets of size 1 with support at least tg

2. For each sj 2 S1t check to see which two-element item sets


fsj ; sj 0 gj 0 6=j and sj 0 2S1t

have support/prevalence
1
# fi j xij xij 0 = 1g
N
at least t and place them in the set
S2t = fitem sets of size 2 with support at least tg
..
.
8 9
<mz }| { =
1 entries
t
m. For each sj ; sj 0 ; : : : 2 Sm 1 check to see which m-element item sets
: ;

= fj; j 0 ; : : :g and sj 2 S1t


fsj ; sj 0 ; : : :g [ fsj g for j 2
have support/prevalence
1
# fi j xij xij 0 xij = 1g
N
at least t and place them in the set
t
Sm = fitem sets of size m with support at least tg
t
This algorithm terminates when at some stage m the set Sm is empty. Then
a sensible set of item sets (to consider for making association rules) is S t =
t
[Sm , the set of all item sets with prevalence in the training data of at least
t. Apparently for commercial databases of "typical size," unless t is very small
it is feasible to use this algorithm to to …nd S t . It is also possible to use a
variant of the apriori algorithm to …nd all association rules based on item sets
in S t with con…dence at least c. This then produces a database of association
rules that can be queried by a user wishing to identify useful structure in the
database/training dataset.
In a more statistical vein, one can adopt from S t some consequent of interest
S = fs1 ; s2 ; : : : ; sl g and consider modeling of the binary variable
Y
I [all items in S are in a transaction] = xj
j s.t. sj 2S

on the basis of some non-overlapping set of variables related to an antecedent


S (disjoint from S belonging to S t ). For example, a natural possibility is
to use logistic regression based on the set of variables xj with sj 2 S to look
for items (or sets of items if products of these indicators are employed) that are
associated with "large" (or "increased") probabilities of the consequent.

205
17.2 Clustering
Typically (but not always) the object in "clustering" is to …nd natural groups
of rows or columns of 0 0 1
x1
B x02 C
B C
X =B . C
N p @ .. A
x0N
(in some contexts one may want to somehow …nd homogenous "blocks" in a
properly rearranged X). Sometimes all columns of X represent values of
continuous variables (so that ordinary arithmetic applied to all its elements
is meaningful). But sometimes some columns correspond to ordinal or even
categorical variables. In light of all this, we will let xi i = 1; 2; : : : ; r stand for
"items" to be clustered (that might be rows or columns of X) with entries that
need not necessarily be continuous variables.
In developing and describing clustering methods, it is often useful to have
a dissimilarity measure d (x; z) that (at least for the items to be clustered and
perhaps for other possible items) quanti…es how "unalike" items are. This
measure is usually chosen to satisfy

1. d (x; z) 0 8x; z

2. d (x; x) = 0 8x, and


3. d (x; z) = d (z; x) 8x; z.

It may be chosen to further satisfy

4. d (x; z) d (x; w) + d (z; w) 8x; z; and w, or


40 . d (x; z) max [d (x; w) ; d (z; w)] 8x; z; and w.

Where 1-4 hold, d is a "metric." Where 1-3 hold and the stronger condition 40
holds, d is an "ultrametric."
In a case where one is clustering rows of X and each column of X contains
values of a continuous variable, a squared Euclidean distance is a natural choice
for a dissimilarity measure
p
X
2 2
d (xi ; xi0 ) = kxi xi0 k = (xij x i0 j )
j=1

In a case where one is clustering columns of X and each column of X contains


values of a continuous variable, with rjj 0 the sample correlation between values
in columns j and j 0 , a plausible dissimilarity measure is

d (xj ; xj 0 ) = 1 jrjj 0 j

206
When dissimilarities between r items are organized into a (non-negative
symmetric) r r matrix

D = (dij ) = (d (xi ; xj ))

with 0s down its diagonal, the terminology "proximity matrix" is often used.
For some clustering algorithms and for some purposes, the proximity matrix
encodes all one needs to know about the items to do clustering. One seeks a
partition of the index set f1; 2; : : : ; rg into subsets such that the dij for indices
within a subset are small (and the dij for indices i and j from di¤erent subsets
are large).

17.2.1 Partitioning Methods ("Centroid"-Based Methods)


By far the most commonly used clustering methods are based on partitioning
related to "centroids," particularly the so called "K-means" clustering algorithm
for the rows of X in cases where the columns contain values of continuous
variables xj (for which arithmetic averaging makes sense).43
The algorithms begins with some set of K distinct "centers" c01 ; c02 ; : : : ; c0K .
They might, for example, be a random selection of the rows of X (subject to
the constraint that they are distinct). One then assigns each xi to that center
c0k0 (i) minimizing
d xi ; c0l
over choice of l (creating K clusters around the centers) and replaces all of the
c0k with the corresponding cluster means

1 X
c1k = 0
I k 0 (i) = k xi
# of i with k (i) = k

At stage m with all cm


k
1
available, one then assigns each xi to that center
cm 1
km 1 (i)minimizing
d xi ; clm 1

over choice of l (creating K clusters around the centers) and replaces all of the
cm
k
1
with the corresponding cluster means

1 X
cm
k = I km 1
(i) = k xi
# of i with k m 1 (i) =k

This iteration goes on to convergence. One compares multiple random starts


for a given K (and then minimum values found for each K) in terms of
K
X X
Total Within-Cluster Dissimilarity (K) = d (xi ; ck )
k=1 xi in cluster k

this context, a natural choice of d (x; z) is kx zk2 : A fancier option might be built
4 3 In

on squared Mahalanobis distance, (x z)0 Q (x z) for some non-negative de…nite Q.

207
for c1 ; c2 ; : : : ; cK the …nal means produced by the iterations.44 One may then
consider the (monotone) sequence of Total Within-Cluster Dissimilarities and
try to identify a value K beyond which there seem to be diminishing returns for
increased K.
A more general version of this algorithm (that might be termed a K-medoid
algorithm) doesn’t require that the entries of the xi be values of continuous
variables, but (since it is then unclear that one can even evaluate, let alone
…nd a general minimizer of, d (xi ; )) restricts the "centers" to be original items.
This algorithm begins with some set of K distinct "medoids" c01 ; c02 ; : : : ; c0K that
are a random selection from the r items xi (subject to the constraint that they
are distinct). One then assigns each xi to that medoid c0k0 (i) minimizing

d xi ; c0l

over choice of l (creating K clusters associated with the medoids) and replaces
all of the c0k with c1k the corresponding minimizers over the xi0 belonging to
cluster k of the sums X
d (xi ; xi0 )
i with k0 (i)=k

At stage m with all cm


k
1
available, one then assigns each xi to that medoid
cm 1
km 1 (i) minimizing
d x i ; cm
l
1

over choice of l (creating K clusters around the medoids) and replaces all of the
cm
k
1
with cm
k the corresponding minimizers over the xi0 belonging to cluster k
of the sums X
d (xi ; xi0 )
i with km 1 (i)=k

This iteration goes on to convergence. One compares multiple random starts


for a given K (and then minimum values found for each K) in terms of
K
X X
d (xi ; ck )
k=1 xi in cluster k

for c1 ; c2 ; : : : ; cK the …nal medoids produced by the iterations.

17.2.2 Hierarchical Methods


To apply a hierarchical clustering method, one must …rst choose a method of
using dissimilarities for items to de…ne dissimilarities for clusters. Three com-
mon (and somewhat obvious) possibilities in this regard are as follows. For C1
and C2 di¤erent elements of a partition of the set of items, or equivalently their
r indices, one might de…ne dissimilarity of C1 and C2 as
4 4 For a squared Euclidean distance d, this is a total squared distance of x s to their corre-
i
sponding cluster means.

208
1. D (C1 ; C2 ) = min fdij ji 2 C1 and j 2 C2 g (this is the "single linkage" or
"nearest neighbor" choice),
2. D (C1 ; C2 ) = max fdij ji 2 C1 and j 2 C2 g (this is the "complete linkage"
choice), or
P
3. D (C1 ; C2 ) = #C11#C2 i2C1 ; j2C2 dij (this is the "average linkage" choice).

There are both agglomerative/bottom-up methods and divisive/top-down


methods of hierarchical clustering. An agglomerative hierarchical clustering
algorithm operates as follows. One begins with every item xi ; i = 1; 2; : : : ; r
functioning as a singleton cluster. Then one …nds the minimum dij for i 6= j
and puts the corresponding two items into a single cluster (of size 2). Then
when one is at a stage where there are m clusters, one …nds the two clusters
with minimum dissimilarity and merges them to make a single cluster, leaving
m 1 clusters overall. This continues until there is only a single cluster. The
sequence of r di¤erent clusterings (with r through 1 clusters) serves as a menu
of potentially interesting solutions to the clustering problem. These are often
displayed in the form of a dendogram, where cutting the dendogram at a given
level picks out one of the (increasingly coarse as the level rises) clusterings.
Those items clustered together "deep" in the tree/dendogram are presumably
interpreted to be potentially "more alike" than ones clustered together only at
a high level.
A divisive hierarchical algorithm operates as follows. Starting with a single
"cluster" consisting of all items, one …nds the maximum dij and uses the two
corresponding items as seeds for two clusters. One then assigns each xl for
l 6= i and l 6= j to the cluster represented by xi if
d (xi ; xl ) < d (xj ; xl )
and to the cluster represented xj otherwise. When one is at a stage where there
are m clusters, one identi…es the cluster with largest dij corresponding to a pair
of elements in the cluster, splitting it using the method applied to split the
original "single large cluster" (to produce an (m + 1)-cluster clustering). This,
like the agglomerative algorithm, produces a sequence of r di¤erent clusterings
(with 1 through r clusters) that serves as a menu of potentially interesting
solutions to the clustering problem. And like the sequence produced by the
agglomerative algorithm, this sequence can be represented using a dendogram.
One may modify either the agglomerative or divisive algorithms by …xing a
threshold t > 0 for use in deciding whether or not to merge two clusters or to
split a cluster. The agglomerative version would terminate when all pairs of
existing clusters have dissimilarities more than t. The divisive version would
terminate when all dissimilarities for pairs of items in all clusters are below
t. Fairly obviously, employing a threshold has the potential to shorten the
menu of clusterings produced by either of the methods to include less than r
clusterings. (Obviously, thresholding the agglomerative method cuts o¤ the
top of the corresponding full dendogram, and thresholding the divisive method
cuts o¤ the bottom of the corresponding full dendogram.)

209
17.2.3 (Mixture) Model-Based Methods
A completely di¤erent approach to clustering into K clusters is based on use
of mixture models. That is, for purposes of producing a clustering, one might
consider acting as if items x1 ; x2 ; : : : ; xr are realizations of r iid random vectors
with parametric marginal density
K
X
q (xj ; 1; : : : ; K) = k p (xj k ) (182)
k=1
PK
for probabilities k > 0 with k=1 k = 1, a …xed parametric density p (xj ),
and parameters 1 ; : : : ; K . (Without further restrictions the family of mixture
distributions speci…ed by density (182) is not identi…able, but we’ll ignore that
fact for the moment.)
A useful way to think about this formalism is in terms of a K-class clas-
si…cation model where values of y are latent/unobserved/completely …ctitious.
This produces density (182) as the marginal density of x. Further, in the model
including a latent y
k p (xj k )
P [y = kjx] = PK
k=1 k p (xj k )

is the (Bayes posterior) probability that x was generated by component k of


the mixture. It then would make sense to de…ne cluster k to be the set of xi
for which
l p (xi j l )
k = arg max PK = arg max l p (xi j l )
l k=1 k p (xi j k ) l

This is the set of xi that would be classi…ed to class k by the optimal (Bayes)
classi…er.
In practice, ; 1 ; : : : ; K must be estimated and estimates used in place of
parameters in de…ning clusters. That is, an implementable clustering method
is to de…ne cluster k (say, Ck ) to be

Ck = xi jk = arg max bl p xi j bl (183)


l

Given the lack of identi…ability in the unrestricted mixture model, it might


appear that prescription (183) could be problematic. But such is not really the
case. While the likelihood
r
Y
L ( ; 1; : : : ; K ) = q (xi j ; 1 ; : : : ; K )
i=1

will have multiple maxima, using any maximizer for an estimate of the parameter
vector will produce the same set of clusters (183). It is common to employ the
"EM algorithm" in the maximization of L ( ; 1 ; : : : ; K ) (the …nding of one
of many maximizers) and to include details of that algorithm in expositions
of model-based clustering. However, strictly speaking, that algorithm is not
intrinsic to the basic notion here, namely the use of the clusters in display (183).

210
17.2.4 Biclustering
An interesting and often useful variant of the clustering problem is one in which
a doubly indexed set of observations xij for i = 1; 2; : : : ; I and j = 1; 2; : : : ; J
(that might be thought of as laid out in an I J two-way array or table) needs
to be simultaneously be put into R (row) clusters over index i and C (columns)
clusters over index j in such a way that the R C cells are each homogeneous.
Figure 42 portrays an I = 6 by J = 12 toy example with values of 72 univariate
xij portrayed in "heat map" fashion. The object of simple biclustering is to
regroup/rearrange rows and columns to make groups producing homogeneous
"cells." We’ll use the notation r (i) for the row cluster index for data row i and
c (j) for the column cluster index for data column j.

Figure 42: A toy 6 12 dataset clustered into R = 2 row clusters and C = 3


column clusters. (From Li, Reisner, Pham, Olafsson and Vardeman.) Values
of xi s are portrayed in heat map fashion.

An Alternating Shu- ing Algorithm An "alternating shu- ing" algorithm


of Li et al. for …nding R good sets of rows and simultaneously C good sets of
columns is based on a series of R C matrices of means M = (mrc ) and R row
vectors of length J and C column vectors of length I with entries from rows
and columns of M .

1. One begins with some clustering of rows of the data matrix into R clusters
and columns of the data matrix into C clusters, and computes for each
(r; c) "cell" a sample mean of xij s with r (i) = r and c (j) = c (with row
i in row cluster r and column j in column cluster c) creating an initial
matrix M .
2. For each r = 1; 2; : : : ; R one makes a new (J-dimensional) row vector
"center" v r with jth entry mrc(j) assigned and re-clusters all rows in
"K-means" fashion (assigning each row of values xij to the closest center
using squared Euclidean <J distance). With this new row clustering one
recomputes the matrix of means M .

211
3. For each c = 1; 2; : : : ; C one makes a new (I-dimensional) column vector
"center" wc with ith entry mr(i)c and re-clusters all columns in "K-means"
fashion (assigning each column of values xij to the closest center using
squared Euclidean <I distance). With this new column clustering one
recomputes the matrix of means M .
P 2
4. If i;j xij mr(i)c(j) is small and/or has ceased to decline with itera-
tions, the algorithm terminates. Otherwise it returns to step 2.

Various "tweaks" are applied to this algorithm to deal with the eventuality
that row or column clusters go empty. Multiple random starts are employed in
the search of a good biclustering. The issue of what R and C should be used
involves weighing complexity (large numbers of clusters) against a small value
of the cell inhomogeneity criterion of step 4. All of this said, the algorithm
is simple and e¤ective, and appropriate modi…cation of it allows the direct
handling of even cases where not every cell of the I J table is full.

Chakraborty’s Bayes Biclustering The dissertation of Abhishek Chakraborty


takes a Bayes modeling and analysis approach to biclustering univariate obser-
vations xij . To the notation above, add model parameters rc for r = 1; : : : ; R
and c = 1; : : : ; C; and 2 > 0 and adopt a data model that given these parame-
ters the I J observations xij are independent with
2
xij N r(i)c(j) ;

Chakraborty’s Bayes analysis then sets priors of independence for the vector
r = (r (1) ; r (2) ; : : : ; r (I)), the vector c = (c (1) ; c (2) ; : : : ; c (J)), and the R C
means rc .
A useful prior distribution for the means rc is one of iid N 0; 2 variables
for a parameter 2 > 0. Useful priors for r and c are based on "Polya urn
schemes." Take the case of r. Let

nq (r) = # [r (i) = q for i = 1; 2; ; I]

and for an > 0 consider the distribution with pmf


I
!
Y 1 Y
g (r) = +1 + nq (r) 1
i=1
+i 1 I I I
q s.t. nq (r)>0

This symmetric distribution has the conditional distribution that

I + # [r (i) = r for i = 1; 2; ;I 1]
g (r (I) = rjr (1) ; r (2) ; : : : ; r (I 1)) =
+I 1
The case of a prior h (c) is completely analogous. The parameters 2 ; 2 ; and
are treated as tuning parameters for the analysis.
This probability structure admits very simple Gibbs MCMC sampling and
provides iterates from the posterior distribution over all of the means and (more

212
importantly) over the biclustering speci…ed by the pair (r; c). For a given pair
(r; c), rows i and i0 with r (i) = r (i0 ) are clustered together, and columns j
and j 0 with with c (j) = c (j 0 ) are clustered together. Observations xij and
xi0 j 0 with both r (i) = r (i0 ) and c (j) = c (j 0 ) are in the same "cell" of the two-
way clustering. The MCMC provides (through simple relative frequencies for
j
iterates (r; c) ) approximate posterior probabilities that each pair of rows, each
pair of columns, and each pair observations belong together in a clustering.
There are various ways to make use of the iterates representing the posterior
j
distribution. One is to carry along with MCMC iterates (r; c) iterates of the
means matrix M (from the Li et al. algorithm) and identify an iterate with
P 2
minimum i;j xij mr(i)c(j) , using that iterate to represent the posterior
distribution. Another (preferable) option is to identify a "central" iterate as
follows. For two pairs (r; c) and (r ; c ) one measure of their total disagreement
in clustering of the xij s is

L ((r; c) ; (r ; c ))
X
= I [r (i) = r (i0 ) and c (j) = c (j 0 )] I [r (i) 6= r (i0 ) or c (j) 6= c (j 0 )]
(i;j);(i0 ;j 0 )
X
+ I [r (i) 6= r (i0 ) or r (j) 6= c (j 0 )] I [r (i) = r (i0 ) and c (j) = c (j 0 )]
(i;j);(i0 ;j 0 )

the total number of xij clustered together by only one of the two associated
biclusterings. For …xed (r; c) one might take

j
L ((r; c)) = L (r; c) ; (r; c)

to be the arithmetic average across MCMC iterates of disagreement between


clusterings of the xij s prescribed by (r; c). and by the iterates. An (r; c)
minimizing this is a kind of central biclustering for representing the posterior,
and while exact optimization of L ((r; c)) is computationally too hard, simply
j
picking an iterate with smallest L (r; c) seems to be an e¤ective way to
represent the posterior and provide a single practically useful biclustering.
It is worth pointing out several things about this methodology. First, is a
kind of "prior sample size" and controls the distribution of the number of non-
empty row (and column) clusters. Small goes with posterior distributions
for r (or c) concentrated on possible values with relatively few implied row (or
column) clusters. Large amounts to a prior for r (or c) with iid uniform
coordinates, typically giving large weight to r (or c) values with many implied
row (or column) clusters. Second, in this development, it is quite natural and
e¤ective to use values of R and C that are only loose upper bounds for seemingly
appropriate numbers of row and column clusters, and let the analysis more or
less sort out what numbers are genuinely plausible (in terms of the posterior
distributions of non-zero nq (r) and nq (c)). Finally, it is possible to allow some
xij s to be unobserved in this development. In "missing at random" contexts,
unobserved xij s can simply be treated as latent or auxiliary in the MCMC. And

213
for other contexts, modeling of censoring mechanisms provides Bayes analyses
where missingness is informative about the value of an unobserved xij .

17.2.5 Self-Organizing Maps


For items x1 ; x2 ; : : : ; xr belonging to <p , the object here is to …nd L M
cluster centers/prototypes that adequately represent the items, where one wishes
to think of those cluster centers/prototypes as indexed on an L M regular
grid in 2 dimensions (that one might take to be f1; 2; : : : ; Lg f1; 2; : : : ; M g)
with cluster centers/prototypes whose index vectors are close on the grid being
close in <p . (There could, of course, be 3-dimensional versions of this, and
so on.) The object is both production of the set of centers/prototypes and
assignment of data points to centers/prototypes. It thus amounts to some kind
of modi…ed/constrained K = L M group clustering problem and simultaneous
discovery of low-dimensional (typically 2-dimensional) structure in the items.
This is illustrated in cartoon fashion in Figure 43.

Figure 43: Cartoon of a Self-Organizing-Map assignment of points x in <p to


cluster centers, themselves mapped to points on a grid in <2 .

One willPtypically begin with standardization of the p coordinate variables


xj (so that xi = 0 and the sample variance of each set of values fxij gi=1;2;:::;r
is 1). This puts all of the xj on the same scale and doesn’t allow one coordinate
of an xi to dominate a Euclidean norm. Standard treatment of this topic seems
to be driven by two somewhat ad hoc algorithms of Kohonen. Here we’ll …rst
describe those algorithms and then discuss a Bayes approach to the problem
due to Zhou.

Kohonen’s Algorithms One begins with some set of initial cluster centers
z 0lm l=1;:::;L and m=1;:::;M . This might be a random selection (without replace-
ment or the possibility of duplication) from the set of items. It might be a set
of grid points in the 2-dimensional plane in <p de…ned by the …rst two princi-
pal components of the items fxi gi=1;:::;r . And there are surely other sensible

214
possibilities. Then de…ne neighborhoods on the L M grid, N (l; m), that
are subsets of the grid "close" in some kind of distance (like regular Euclidean
distance) to the various elements of the L M grid. N (l; m) could be all of
the grid, (l; m) alone, all grid points (l0 ; m0 ) within some constant 2-dimensional
Euclidean distance of (l; m), etc. Then de…ne a weighting function on <p , say
w (kxk), so that w (0) = 1 and w (kxk) 0 is monotone non-increasing in kxk.
For some schedule of non-increasing positive constants 1 > 1 2 3n o,
j
the SOM algorithms de…ne iteratively sets of cluster centers/prototypes z lm
for j = 1; 2; : : :.
At iteration j, an "online" version of SOM selects (randomly or perhaps in
turn from an initially randomly set ordering of the items) an item xj and
j 1
1. identi…es the center/prototype z lm closest to xj in to <p , call it bj with
j
corresponding grid coordinates (l; m) (Izenman calls bj the "BMU" or
best-matching-unit),
j
2. adjusts those z jlm1 with index vectors belonging to N (l; m) (close to
j
the BMU index vector on the 2-dimensional grid) toward x by the pre-
scription
j 1
z jlm = z jlm1 + j w z lm bj xj z jlm1

(adjusting those centers di¤erent from the BMU potentially less dramati-
cally than the BMU), and
j
3. for those z jlm1 with index pairs (l; m) not belonging N (l; m) sets

z jlm = z jlm1

iterating to convergence.
n At oiteration
n j,
o a "batch" version of SOM updates all centers/prototypes
z jlm1 to z jlm
as follows. For each z jlm1 , let Xlm
j 1
be the set of items for
n o
which the closest element of z jlm1 has index pair (l; m). Then update z jlm1
j 1
as some kind of (weighted) average of the elements of [(l;m)0 2N (l;m) X(l;m) 0 (the

set of xi closest to prototypes with labels that are 2-dimensional grid neighbors
of (l; m)). A natural form of this is to set (with xj(l;m)
1
the obvious sample mean
j 1
of the elements of Xlm )
P j 1 j 1 j 1
(l;m)0 2N (l;m) w z lm z (l;m) 0 x(l;m) 0
z jlm = P
(l;m)0 2N (l;m) w z jlm1 z j(l;m)
1
0

It is fairly obvious that even if these algorithms converge, di¤erent starting


sets z 0lm will produce di¤erent limits (symmetries alone mean, for example,
that the choices z 0lm = ulm and z 0lm = uL l;M m produce what might look

215
like di¤erent limits, but are really completely equivalent). Beyond this, what is
provided by the 2-dimensional layout of indices of prototypes is not immediately
obvious. It seems to be fairly common to compare an error sum of squares for
a SOM to that of a K = L M means clustering and to declare victory if the
SOM sum is not much worse than the K-means value.

Zhou’s Bayesian SOM Dissertation work of Rick Zhou takes a principled


Bayesian modeling and decision-theoretic approach to the SOM objective. The
following is an overview of his methodology.
To develop a useful and "generative" model for x1 ; x2 ; : : : ; xr belonging to
<p , begin by de…ning p (one for each dimension of the data vectors) 0 mean
Gaussian spatial processes

1 (u; v) ; 2 (u; v) ; : : : ; p (u; v)

and set 0 1
(u; v) 1
B .. C
(u; v) = @ . A
p (u; v)

(u; v) then de…nes a continuous random map <2 ! <p . For L M points
= (l; m) on an integer grid in <2 take (l; m) as the center of a data-generating
mechanism in <p . Then assume that x1 ; : : : ; xr are iid as follows. First, one
of the L M …xed points = (l; m) on the grid of interest is chosen at random,
and then conditioned on this choice

x MVNp ( ( ) ; )

Upon supplying suitable (values of or) prior distributions for the parameters of

the p Gaussian processes and priors for the covariance matrices l;m , MCMC
will for observable x1 ; : : : ; xr and corresponding latent 1 ; : : : ; r produce sam-
ples from a posterior distribution over all of

1; 2; : : : ; r

j ( ) for all points in the grid and j = 1; 2; : : : ; p


for all points in the grid

What are of most interest are the grid points for the r cases, 1 ; : : : ; r . Two
cases xi and xi0 belong to the the same cluster if i = i0 . The MCMC
provides relative frequencies that approximate posterior probabilities that case
i and case i0 belong together, P [ i = i0 ]. That is, one obtains an estimate C ^
of the matrix
C = (P [ i = i0 ]) i=1;2;:::;r
r r i0 =1;2;:::;r

through MCMC relative frequencies. What one is then led to seek as a …nal
work product is an assignment of data points to grid points that

216
1. is consistent with C, and
2. (at least locally) more or less preserves relative distances between clusters
in <p in terms of distances between corresponding grid points in <2 .
For a potential assignment of data points to grid points (that maps
f1; 2; : : : ; rg to the set of indices = (i; j) in the grid) we consider two types of
penalties, one for inconsistency with C and another for failure to preserve dis-
tances. First consider disagreement with C. A measure of disparity between
partitions of f1; 2; : : : ; rg corresponding to 1 ; : : : ; r and to 1 ; : : : ; r is for
a > 0 and b > 0
X
L (( 1 ; : : : ; r ) ; ( 1 ; : : : ; r )) = aI [ i = i0 and i 6= i0 ]
i<i0
X
+ bI [ i 6= i0 and i = i0 ]
i<i0

The average of this with respect to the posterior distribution is


X X b
a ci;i0 (a + b) I [ i = i0 ] ci;i0
0 0
a + b
i<i i<i

so a plausible penalty for inconsistency with C is


1 X
R1 (( 1 ; : : : ; r ) ; C; ) = I[ i = i0 ] ( ci;i0 )
r (r 1) 0 n<n

In the penalty R1 (( 1 ; : : : ; r ) ; C; ) the parameter 2 (0; 1) determines what


kinds of partitions of f1; 2; : : : ; rg are most heavily penalized. Large tends
to heavily penalize ( 1 ; : : : ; r ) prescribing large clusters, and small tends to
heavily penalize ( 1 ; : : : ; r ) with small clusters.
Consider then penalizing failure to preserve distances. De…ne maximum
distances
0
Mgrid = max
0
k k and
and on the grid

Mdata = max
0
kxi xi0 k
i;i

And de…ne for r 2 f1; 2; : : : ; Kg the sets NK consisting of those pairs i and i0
such that at least one of the points xi and xi0 is in the K-nearest neighborhood
of the other. Then, a "local multi-dimensional scaling" type penalty45 to apply
to a potential assignment of data points to grid points is
8 X 2 9
>
> kxi xi0 k k i i0 k >
>
>
> Md a t a Mg r i d >
>
>
> 0 >
>
1 < i<i
0
s.t. =
R2 (( 1 ; : : : ; r ) ; K; ) = 2 ( i;i )2NK
X k
K >> i i0 k >
>
>
> Mg r i d >
>
>
> 0 >
>
: i<i
0
s.t. ;
(i;i )2N= K

4 5 See Section 17.3.

217
for a > 0. (The …rst term penalizes failure to preserve local relative distances
and the second encourages separation of mappings to points on the grid that
are not neighbors in the <p dataset.)
So, in looking for a map that is consistent with the posterior distribution
and preserves local relative distances, a risk/…gure of merit is for > 0

R ( 1; : : : ;
^
r ) ; C; ; K; ; = R1 ( 1; : : : ;
^
r ) ; C; + R2 (( 1; : : : ; r ) ; K; )

^ ; K; ;
Exact optimization of R ( 1 ; : : : ; r ) ; C; by choice of ( 1 ; : : : ; r )
is in general an NP-hard problem and is thus rarely possible. What is possible
and seems to work remarkably well is to make a long MCMC run (making one’s
^ reliable) and then look for an MCMC iterate j
estimate C ; : : : ; jr with the
1
j ^ ; K; ; . The dissertation of Zhou provides
j
best value of R 1; : : : ;
; C; r
substantial examples of the e¤ectiveness of this strategy. The Bayes model
behind the MCMC simply tends to concentrate the posterior (and thus make
iterates) in a manner consistent with the clustering and distance preservation
goals of SOM.
The famous "Wines" dataset has p = 13 chemical characteristics of r = 178
wine samples from 3 di¤erent cultivars (59 (red) samples. 71 (blue) samples, and
48 (violet) of the three types indexed 1-59, 60-130, and 131-178 respectively).
Figure 44 is a graphical (grey-scale) representation of C^ and a corresponding
j j
best iterate 1; : : : ; r .

Figure 44: Bayes SOM representation of clustering of chemical characteristics


vectors for r = 178 wine samples from 3 di¤erent cultivars. (From the PhD
dissertation of Zhou.)

17.3 Multi-Dimensional Scaling


This material begins (as in Section 17.2) with dissimilarities among N items,
dij , that might be collected in an N N proximity matrix D = (dij ). (These

218
might, but do not necessarily, come from Euclidean distances among N data
vectors x1 ; x2 ; : : : ; xN in <p .) The object of multi-dimensional scaling is to (to
the extent possible) represent the N items as points z 1 ; z 2 ; : : : ; z N in <q with

kz i zj k dij

This is phrased precisely in terms of one of several optimization problems, where


one seeks to minimize a "stress function" S (z 1 ; z 2 ; : : : ; z N ).
The least squares (or Kruskal-Shepard) stress function (optimization crite-
rion) is X 2
SLS (z 1 ; z 2 ; : : : ; z N ) = (dij kz i z j k)
i<j

This criterion treats errors in reproducing big dissimilarities exactly like it treats
errors in reproducing small ones. A di¤erent point of view would make faith-
fulness to small dissimilarities more important than the exact reproduction of
big ones. The so-called Sammon mapping criterion
X (dij kz i
2
z j k)
SSM (z 1 ; z 2 ; : : : ; z N ) =
i<j
dij

re‡ects this point of view.


Another approach to MDS that emphasizes the importance of small dissimi-
larities is discussed in HTF under the name of "local multi-dimensional scaling."
Here one begins for …xed k with the symmetric set of index pairs

the number of j 0 with dij 0 < dij is less than k


Nk = (i; j) j
or the number of i0 with di0 j < dij is less than k

(an index pair is in the set if one of the items is in the k-nearest neighbor neigh-
borhood of the other). Then a stress function that emphasizes the matching of
small dissimilarities and not large ones is (for some choice of > 0)
X 2
X
SL (z 1 ; z 2 ; : : : ; z N ) = (dij kz i z j k) kz i z j k
i<j and (i;j)2Nk i<j and (i;j)2N
= k

Another version of MDS begins with similarities sij (rather than with dis-
similarities dij ). (One important special case of similarities derives from vectors
x1 ; x2 ; : : : ; xN in <p through centered inner products sij hxi x; xj xi.)
A "classical scaling" criterion is
X 2
SC (z 1 ; z 2 ; : : : ; z N ) = (sij hz i z; z j zi)
i<j

HTF claim that if in fact similarities are centered inner products, classical scal-
ing is exactly equivalent to principal components analysis.

219
The four scaling criteria above are all "metric" scaling criteria in that the
distances kz i z j k are meant to approximate the dij directly. An alternative
is to attempt minimization of a non-metric stress function like
P 2
i<j ( (dij ) kz i z j k)
SNM (z 1 ; z 2 ; : : : ; z N ) = P 2
i<j kz i zj k

over vectors z 1 ; z 2 ; : : : ; z N and increasing functions ( ). ( ) will preserve/enforce


the natural ordering of dissimilarities without attaching importance to their pre-
cise values. Iterative algorithms for optimization of this stress function alternate
between isotonic regression to choose ( ) and gradient descent to choose the
zi.
In general, if one can produce a small value of stress in MDS, one has discov-
ered a q-dimensional representation of N items, and for small q, this is a form
of "simple structure."

17.4 More on Principal Components and Related Ideas


Here we extend the principal components ideas …rst raised in Section 2.4 based
on the SVD ideas of Section 2.3, still with the motivation of using it as means
for identifying simple structure in an N -case p-variable dataset, where as earlier,
we write 0 0 1
x1
B x02 C
B C
X =B . C
N p @ .. A
x0N

17.4.1 "Sparse" Principal Components


In standard principal components analysis, the v j are sometimes called "load-
ings" because (in light of the fact that z j = Xv j ) they specify what linear
combinations of variables xj are used in making the various principal compo-
nent vectors. If the v j were "sparse" (had lots of 0s in them) interpretation of
these loadings would be easier. So people have made proposals of alternative
methods of de…ning "principal components" that will tend to produce sparse
results. One due to Zou is as follows.
One might call a v 2 <p a …rst sparse principal component "direction"46 if
it is part of a minimizer (over choices of v 2 <p and 2 <p with k k = 1) of
the criterion
XN
2 2
xi x0i v + kvk + 1 kvk1 (184)
i=1

for k k1 the l1 norm on <p and constants 0 and 1 0. The last term in this
expression is analogous to the lasso penalty on a vector of regression coe¢ cients
4 6 We put quotes on "direction" because in this formulation v will typically not be a unit

vector.

220
as considered in Section 3.1.2, and produces the same kind of tendency to "0
out" entries that we saw in that context. If 1 = 0, v, is proportional to the
ordinary …rst principal component direction. In fact, if = 1 = 0 and N > p,
v = the ordinary …rst principal component direction is the optimizer.
For multiple components, an analogue of the …rst case is a set of K vectors
v k 2 <p organized into a p K matrix V that is part of a minimizer (over
choices of p K matrices V and p K matrices with 0 = I) of the
criterion
XN XK K
X
2 2
xi V 0 xi + kv k k + 1k kv k k1 (185)
i=1 k=1 k=1
for constants 0 and 1k 0. Zou has apparently provided e¤ective
algorithms for optimizing criteria (184) or (185).

17.4.2 Non-negative Matrix Factorization


There are contexts (for example, when data are counts) where it may not make
intuitive sense to center inherently non-negative variables, so that X is naturally
non-negative, and one might want to …nd non-negative matrices W and H such
that
X W H
N p N rr p

Here the emphasis might be on the columns of W as representing "positive


components" of the (positive) X, just as the columns of the matrix U D in
SVDs provide the principal components of X. Various optimization criteria
could be set to guide the choice of W and H. One might try to minimize
p
N X
X 2
xij (W H)ij
i=1 j=1

or maximize
p
N X
X
xij ln (W H)ij (W H)ij
i=1 j=1

over non-negative choices of W and H, and various algorithms for doing these
have been proposed. (Notice that the second of these criteria is an extension
of a loglikelihood for independent Poisson variables with means entries in W H
to cases where the xij need only be non-negative, not necessarily integer.)
While at …rst blush this enterprise seems sensible, there is a lack of unique-
ness in a factorization producing a product W H, and therefore how to inter-
pret the columns of one of the many possible W s is not clear. (An easy way
to see the lack of uniqueness is this. Suppose that all entries of the product
W H are positive. Then for E a small enough (but not 0) matrix, all entries
1
of W W (I + E) 6= W and H (I + E) H 6= H are positive, and
W H = W H.) Lacking some natural further restriction on the factors W
and H (beyond non-negativity) it seems the practical usefulness of this basic
idea is also lacking.

221
17.4.3 Archetypal Analysis
Another approach to …nding an interpretable factorization of X was provided
by Cutler and Breiman in their "archetypal analysis." Again one means to write
X W H
N p N rr p

for appropriate W and H. But here two restrictions are imposed, namely that
1. the rows of W are probability vectors (so that the approximation to X
is in terms of convex combinations/weighted averages of the rows of H),
and
2. H = B X where the rows of B are probability vectors (so that the
r p r NN p
rows of H are in turn convex combinations/weighted averages of the rows
of X).
The r rows of H = BX are the "prototypes" (?archetypes?) used to represent
the data matrix X.
With this notation and restrictions, (stochastic matrices) W and B are
chosen to minimize
2
kX W BXk
It’s clearly possible to rearrange the rows of a minimizing B and make corre-
2
sponding changes in W without changing kX W BXk . So strictly speaking,
the optimization problem has multiple solutions. But in terms of the set of rows
of H (a set of prototypes of size r) it’s possible that this optimization problem
often has a unique solution. (Symmetries induced in the set of N rows of X
can be used to produce examples where it’s clear that genuinely di¤erent sets of
2
prototypes produce the same minimal value of kX W BXk . But it seems
likely that real datasets will usually lack such symmetries and lead to a single
optimizing set of prototypes.)
Emphasis in this version of the "approximate X" problem is on the set of
prototypes as "representative data cases." This has to be taken with a grain of
salt, since they are nearly always near the "edges" of the dataset. This should
be no surprise, as line segments between extreme cases in <p can be made to
run close to cases in the "middle" of the dataset, while line segments between
interior cases in the dataset will never be made to run close to extreme cases.

17.4.4 Independent Component Analysis


We begin by supposing (as before) that X has been centered. For simplicity,
suppose also that it is full rank (rank p). With the SVD as before
X = U D V0
N p N pp pp p

we consider the "sphered" version of the data matrix


p
X = N XV D 1

222
so that the sample covariance matrix of the data is
1
X 0X =I
N
Note that the columns of X are then scaled principal components of the (cen-
tered) data matrix and we operate with and on X . (For simplicity of notation,
we’ll henceforth drop the " " on X.) This methodology seems to be an at-
tempt to …nd latent probabilistic structure in terms of independent variables to
account for the principal components.
In particular, in its linear form, ICA attempts to model the N (transposed)
rows of X as iid of the form
xi = A si (186)
p 1 p pp 1

for iid vectors si , where the (marginal) distribution of the vectors si is one of
independence of the p coordinates/components and the matrix A is an unknown
parameter. Consistent with our sphering of the data matrix, we’ll assume
that Covx = I and without any loss of generality assume that the covariance
matrix for s is not only diagonal, but that Covs = I. Since then I =Covx =
A (Covs) A0 = AA0 , A must be orthogonal, and so

A0 x = s

Obviously, if one can estimate A with an orthogonal Ab then sbi b 0 xi


A
serves as an estimate of what vector of independent components led to the ith
row of X and indeed
Sb XA b

has columns that provide predictions of the N (row) p-vectors s0i , and we might
thus call those the "independent components" of X (just as we term the columns
of XV the principal components of X). There is a bit of arbitrariness in
the representation (186) because the ordering of the coordinates of s and the
corresponding rows of A is arbitrary. But this is no serious concern.
So then, the question is what one might use as a method to estimate A in
display (186). There are several possibilities. The one discussed in HTF
is related to entropy and Kullback-Leibler distance. If one assumes that
a (m-dimensional) random vector Y has a density p with marginal densities
p1 ; p2 ; : : : ; pm then an "independence version" of the distribution of Y has den-
Qm
sity pj and the (non-negative) K-L divergence of the distribution of Y from
j=1

223
its independence version is
0 1
0 1
m
Y Z
B p (y) C
KL @p; pj A = p (y) ln B
@Qm
C dy
A
j=1 pj (yj )
j=1
Z m Z
X
= p (y) ln p (y) dy p (y) (ln pj (yj )) dy
j=1
Z Xm Z
= p (y) ln p (y) dy pj (yj ) (ln pj (yj )) dyj
j=1
m
X
= H (Yj ) H (Y )
j=1

for H the entropy function for a random argument. Since entropy is an inverse
measure of information for a distribution, this K-L divergence is a di¤erence in
the information carried by Y (jointly) and the sum across the components of
their individual information contents. If it is small, one might loosely interpret
the components of Y as approximately independent.
If one then thinks of s as random and of the form A0 x for random x, it is
perhaps sensible to seek an orthogonal A to minimize (for for aj the jth column
of A)
p
X p
X
H (sj ) H (s) = H a0j x H A0 x
j=1 j=1
p
X
= H a0j x H (x) ln jdet Aj
j=1
Xp
= H a0j x H (x)
j=1

As it turns out, this is equivalent (for orthogonal A) to maximization of


p
X
C (A) = H (z) H a0j x (187)
j=1

for z standard normal. Then a common approximation is apparently


2
H (z) H a0j x EG (z) EG a0j x

for G (u) 1c ln cosh (cu) for a c 2 [1; 2]. Then, criterion (187) has the empirical
approximation
p N
!2
X 1 X
Cb (A) = EG (z) 0
G aj xi 0

j=1
N i=1

224
b can be taken to be an optimizer of
where, as usual, x0i is the ith row of X. A
b
C (A).
Ultimately, this development produces a rotation matrix that makes the p
entries of rotated and scaled principal component score vectors "look as inde-
pendent as possible." This is thought of as resolution of a data matrix into its
"independent sources" and as a technique for "blind source separation."

17.4.5 Principal Curves and Surfaces


In the context of Section 2.4, the line in <p de…ned by fcv 1 jc 2 <g serves as a
best straight line representative of the dataset. Similarly, the "plane" in <p
de…ned by fc1 v 1 + c2 v 2 jc1 2 <; c1 2 <g serves as a best "planar" representative
of the dataset. The notions of "principal curve" and "principal surface" are
attempts to generalize these ideas to 1-dimensional and 2-dimensional structures
in <p that may not be "straight" or "‡at" but rather have some curvature.
A parametric curve in <p is represented by a vector-valued function
0 1
h1 (t)
B h2 (t) C
B C
h (t) = B .. C
@ . A
hp (t)

de…ned on some interval [0; T ], where we assume that the coordinate functions
hj (t) are smooth. With 0 0 1
h1 (t)
B h02 (t) C
B C
h0 (t) = B .. C
@ . A
h0p (t)
the "velocity vector" for the curve, h0 (t) is then the "speed" for the curve
and the arc length (distance) along h(t) from t = 0 to t = t0 is
Z t0
0
Lh (t ) = h0 (t) dt
0

In order to set an unambiguous representation of a curve, it will be useful to


assume that it is parameterized so that it has unit speed, i.e. that Lh (t0 ) = t0
1
for all t0 2 [0; T ]. Notice that if it does not, for (Lh ) the inverse of the arc
length function, the parametric curve
1
g ( ) = h (Lh ) ( ) for 2 [0; Lh (T )] (188)

does have unit speed and traces out the same set of points in <p that are traced
out by h(t). So there is no loss of generality in assuming that parametric
curves we consider here are parameterized by arc length, and we’ll henceforth
write h( ).

225
Then, for a unit speed parametric curve h( ) and point x 2 <p , we’ll de…ne
the projection index

h (x) = sup j kx h ( )k = inf kx h ( )k (189)

This is roughly the last arc length for which the distance from x to the curve is
minimum. If one thinks of x as random, the "reconstruction error"
2
E kx h( h (x))k

(the expected squared distance between x and the curve) might be thought of
as a measure of how well the curve represents the distribution. Of course, for
a dataset containing N cases xi , an empirical analog of this is
N
1 X 2
kxi h( h (xi ))k (190)
N i=1

and a "good" curve representing the dataset should have a small value of this
empirical reconstruction error. Notice however, that this can’t be the only
consideration. If it was, there would sure be no real di¢ culty in running a
very wiggly (and perhaps very long) curve through every element of a dataset
to produce a curve with 0 empirical reconstruction error. This suggests that
with 0 00 1
h1 ( )
B h002 ( ) C
00 B C
h ( )=B .. C
@ . A
00
hp ( )
the curve’s "acceleration vector," there must be some kind of control exercised
on the curvature, h00 ( ) , in the search for a good curve. We’ll note below
where this control is implicitly applied in standard algorithms for producing
principal curves for a dataset.
Returning for a moment to the case where we think of x as random, we’ll say
that h( ) is a principal curve for the distribution of x if it satis…es a so-called
self-consistency property, namely that

h ( ) = E [xj h (x) = ] (191)

This provides motivation for an iterative 2-step algorithm to produce a "prin-


cipal curve" for a dataset.
Begin iteration with h0 ( ) the ordinary …rst principal component line for
the dataset. Speci…cally, do something like the following. For some choice
T > 2 max jhxi ; v 1 ij begin with
i=1;:::;N

T
h0 ( ) = v1
2

226
to create a unit-speed curve that extends past the dataset in both directions
along the …rst principal component direction in <p . Then project the xi onto
the line to get N values

1 T
i = h0 (xi ) = hxi ; v 1 i +
2
and in light of the criterion (191) more or less average xi with corresponding
h0 (xi ) near to get h1 ( ). A speci…c possible version of this is to consider,
for each coordinate, j, the N pairs
1
i ; xij

and to take as h1j ( ) a function on [0; T ] that is a 1-dimensional cubic smoothing


spline. One may then assemble these into a vector function to create h1 ( ).
NOTICE that implicit in this prescription is control over the second deriva-
tives of the component functions through the sti¤ness parameter/weight in the
smoothing spline optimization. Notice also that for this prescription, the unit-
speed property of h0 ( ) will not carry over to h1 ( ) and it seems that one must
usehthe idea (188) to iassure that h1 ( ) is parameterized in terms of arc length
RT
on 0; 0 h1 ( ) d .
With iterate hm 1
( ) in hand, one projects the xi onto the curve to get N
values
m
i = hm 1 (xi )
and for each coordinate j considers the N pairs
m
f( i ; xij )g

Fitting a 1-dimensional cubic smoothing spline to these produces hm j ( ), and


these functions are assembled into a vector to create hm ( ) (which may need
some adjustment via relationship (188) to assure that the curve is parameterized
in terms of arc length). One iterates until the empirical reconstruction error
(190) converges, and one takes the corresponding hm ( ) to be a principal curve.
Notice that which sti¤ness parameter is applied in the smoothing steps will
govern what one gets for such a curve.
There have been e¤orts to extend principal curves technology to the creation
of 2-dimensional principal surfaces. Parts of the extension are more or less clear.
A parametric surface in <p is represented by a vector-valued function
0 1
h1 (t)
B h2 (t) C
B C
h (t) = B .. C
@ . A
hp (t)

for t 2 S <2 . A projection index parallel to form (189) and self-consistency


property parallel to that in display (191) can clearly be de…ned for the surface

227
case, and thin plate splines can replace 1-dimensional cubic smoothing splines
for producing iterates of coordinate functions. But ideas of unit-speed don’t
have obvious translations to <2 and methods here seem fundamentally more
complicated than what is required for the 1-dimensional case.

17.5 (Original) Google PageRanks


This might be thought of as of some general interest beyond the particular appli-
cation of ranking Web pages, if one abstracts the general notion of summarizing
features of a directed graph with N nodes (N Web pages in the motivating
application) where edges point from some nodes to other nodes (there are links
on some Web pages to other Web pages). The basic idea is that one wishes to
rank the nodes (Web pages) by some measure of importance.
If i 6= j de…ne

1 if there is a directed edge pointing from node j to node i


Lij =
0 otherwise

and de…ne
N
X
cj = Lij = the number of directed edges pointed away from node j
i=1

(There is the question of how we are going to de…ne Ljj . We may either declare
that there is an implicit edge pointed from each node j to itself and adopt the
convention that Ljj = 1 or we may declare that all Ljj = 0.)
A node (Web page) might be more important if many other (particularly,
important) nodes have edges (links) pointing to it. The Google PageRanks
ri > 0 are chosen to satisfy47
X Lij
ri = (1 d) + d rj (192)
j
cj

for some d 2 (0; 1) (producing minimum rank (1 d)). (Apparently, a standard


choice is d = :85.) The question is how one can identify the ri .48
Without loss of generality, with
0 1
r1
B C
r = @ ... A
rN
4 7 There L
is the question of what cij should mean in case cj = 0. We’ll presume that the
j
meaning is that the ratio is 0:
4 8 These are, of course, simply N linear equations in the N unknowns r , and for small N
i
one might ignore special structure and simply solve these numerically with a generic solver.
In what follows we exploit some special structure.

228
we’ll assume that r 0 1 = N so that the average rank is 1. Then, for
8
< 1 if c 6= 0
j
dj = c
: 0j if c = 0
j

de…ne the N N diagonal matrix D = diag (d1 ; d2 ; : : : ; dN ). (Clearly, if we


use the Ljj = 1 convention, then dj = 1=cj for all j.) Then in matrix form, the
N equations (192) are (for L = (Lij ) i=1;:::;N )
j=1;:::;N

r = (1 d) 1 + dLDr
1
= (1 d) 110 + dLD r
N
(using the assumption that r 0 1 = N ). Let
1
T = (1 d) 110 + dLD
N
so that r 0 T 0 = r 0 .
Note all entries of T are non-negative and that
0
1
T 01 = (1 d) 110 + dLD 1
N
1
= (1 d) 110 1 + dDL0 1
N 0 1
c1
B c2 C
B C
= (1 d) 1 + dD B . C
@ .. A
cN

so that if all cj > 0, T 0 1 = 1. We have this condition as long as we either limit


application to sets of nodes (Web pages) where each node has an outgoing edge
(an outgoing link) or we decide to count every node as pointing to itself (every
page as linking to itself) using the Ljj = 1 convention. Henceforth suppose
that indeed all cj > 0.
Under this assumption, T 0 is a stochastic matrix (with rows that are prob-
ability vectors), the transition matrix for an irreducible aperiodic …nite state
Markov Chain. De…ning the probability vector
1
p= r
N
it then follows that since p0 T 0 = p0 the PageRank vector is N times the sta-
tionary probability vector for the Markov Chain. This stationary probability
vector can then be found as the limit of any row of
n
T0

229
as n ! 1.

Part VI
Miscellanea
18 Graphs as Representing Independence Rela-
tionships in Multivariate Distributions
The most coherent approaches to statistical machine learning are ultimately
based on probability models for the generation of all of (x; y) and additionally
"all" other "relevant" unobserved/latent variables. Even for small N , such
multivariate distributions are in general impossibly complicated and impossible
to detail on the basis of a training set. Only by making simplifying assumptions
can progress be made. Graphs are often called upon to organize and represent
useful simplifying assumptions about conditional independence between various
of the variables to be jointly modeled, and their use is sometime treated as an
important part of machine learning.
Random quantities X and Y are conditionally independent given Z
written
X kY jZ
provided densities factor as

fX;Y jZ (x; yjz) = fXjZ (xjz) fY jZ (yjz)

A basic result about conditional independence is that

X k Y j Z () fXjY;Z (xjy; z) = fXjZ (xjz)

Conditional independence (like ordinary independence) has some impor-


tant/useful properties/implications. Among these are

1. X k Y j Z ) Y k X j Z,
2. X k Y j Z and U = h (X) ) U k Y j Z,
3. X k Y j Z and U = h (X) ) X k Y j Z; U ,

4. X k Y j Z and X k W j (Y; Z) ) X k (W; Y ) j Z, and


5. X k Y j Z and X k Z j Y ) X k (Y; Z).

A possibly more natural (but equivalent) version of property 3. is

X k Y j Z and U = h (X) ) Y k (X; U ) j Z

230
A main goal of this material is representing aspects of large joint distributions
in ways that allow one to "see" conditional independence relationships in graphs
representing them and to construct correspondingly simple joint and conditional
densities for variables. In this section we will provide a brief introduction to
the simplest ideas in this enterprise. More on the topics can be found in
books by Murphy, by Wasserman, and by Lauritzen. We’ll …rst consider what
relationships are typically represented using directed graphs, and then what
relationships are represented using undirected graphs.

18.1 Some Considerations for Directed Graphical Models


A directed graph (that might potentially represent some aspects of the joint
distribution of (X; Y; Z; : : :)) consists of nodes (or vertices) X; Y; Z; : : : and ar-
rows (or edges) pointing between some of them. A corresponding probability
model for (X; Y; Z; : : :) can variously be known as a directed graphical model,
a Bayes network (though there is nothing intrinsically Bayesian about this ma-
terial), a belief network (though there need be nothing subjective about what
it represents), or a causal network (though again, there is nothing inherently
causal about what it represents).
For a graph with nodes/vertices X; Y; Z; : : :

1. if an arrow points from X to Y we will say that X is a parent of Y and


that Y is a child of X,
2. a sequence of arrows beginning at X and ending at Y will be called a
directed path from X to Y ,
3. if X = Y or there is a directed path from X to Y , we will say that X is
an ancestor of Y and Y is a descendent of X,
4. a directed path that starts and ends at the same vertex is called a cycle,
and

5. a directed graph is acyclic if it has no cycles:

As a matter of notation/shorthand an acyclic directed graph is usually called a


DAG (a directed acyclic graph) although the corresponding word order is not
really as good as that corresponding to the unpronounceable acronym "ADG."

Figure 45: A DAG.

231
In Figure 45, X is a parent of Y and an ancestor of W . There is a directed
path from X to W . Y is a child of both X and Z.
For a vector of random quantities (and vertices) X = (X1 ; X2 ; : : : ; Xk ) and
a distribution P for X, it is said that a DAG G represents P (or P is Markov
to G) if and only if densities satisfy
k
Y
pX (x) = p (xi jparentsi ) (193)
i=1

where
parentsi = fparents of Xi in the DAG Gg
So a joint distribution P for (X; Y; Z; W ) is represented by the DAG pictured
in Figure 45 if and only if

pX;Y;Z;W (x; y; z; w) = pX (x) pZ (z) pY jX;Z (yjx; z) pW jY (wjy) (194)

A condition equivalent to the Markov condition can be stated in terms of


conditional independence relationships. That is, let X ei stand for the set of all
vertices X1 ; X2 ; : : : ; Xk in a DAG G except for the parents and descendents of
Xi . Then
ei j parentsi
P is represented by G , for every vertex Xi; Xi k X (195)

So, for example, if a joint distribution P for (X; Y; Z; W ) is represented by the


DAG pictured in Figure 45 it follows that

X k Z and W k (X; Z) j Y

Condition (195) provides a way to simply identify some conditional indepen-


dence relationships implied by a DAG representation of a joint distribution P .
Upon introducing some more concepts and machinery (concerning "connected-
ness" and "separatedness" of vertices of a DAG) other conditional independence
relationships that will always hold for P represented by G can be identi…ed.
We refer the interested reader to the the books of Murphy, Wasserman, and
Lauritzen for more details. Rather than going further into the probabilistic
implications of various types of structure possessed by a DAG G representing a
distribution P , we will here simply consider in broad terms the practical impli-
cations and di¢ culties associated with adopting a directed graphical model.
In the …rst place, a directed graphical model is less complicated than a
general distribution for the same set of variables, and thus potentially requires
less data to accurately characterize. As a simple toy example, consider a jointly
4
discrete distribution for (X; Y; Z; W ) taking values in a simple set f1; 2; 3g . A
4
general joint distribution for (X; Y; Z; W ) requires the speci…cation of 3 1 = 80
probabilities (the 81st coming from the fact that values pX;Y;Z;W (x; y; z; w) must
sum to 1). On the other hand, if P is represented by the graph G in Figure 45,
then form (194) holds and only 2 values are needed to specify pX (x), 2 values

232
are needed to specify pZ (z) ; 2 values are needed to specify each of 9 di¤erent
conditional pmfs pY jX;Z (yjx; z), and …nally 2 values are needed to specify each
of 3 conditional pmfs pW jY (wjy). That is, there are 2 + 2 + 2 (9) + 2 (3) = 28
probabilities to be speci…ed under form (194), far fewer than in general.
None of this touches the obvious questions of what forms of DAG are ap-
propriate (and why they are so) in particular applications and lead to e¤ective
methods of translating a training set into appropriate estimates for the factors
p (xi jparentsi ) in the expression (193). The question of how to infer the fac-
tors of the product form from training data is particularly perplexing for models
that include latent/hidden/unobserved nodes. Researchers who value this kind
of modeling must obviously produce tractable and believable DAGs and cor-
responding forms for conditional distributions that lead to e¤ective specialized
…tting methods for the kinds of training data they expect to encounter.

18.2 Some Considerations for Undirected Graphical Mod-


els
An undirected graph (that might potentially represent some aspects of the
joint distribution of X = (X1 ; X2 ; : : : ; Xk )) consists of nodes (or vertices )
X1 ; X2 ; : : : ; Xk and edges connecting some of them. A corresponding proba-
bility model for X is variously known as an undirected graphical model, a
Markov random …eld, and a Markov network.
Some of the terminology introduced for directed graphs carries over to undi-
rected graphs. And there are also some important additional concepts. For a
graph with nodes/vertices X; Y; Z; : : :

1. two vertices X and Y are said to be adjacent if there is an edge between


them, here symbolized as X Y ,
2. a sequence of vertices fX1 ; X2 ; : : : ; Xn g is a path if Xi Xi+1 for each i,
3. if A; B; and C are disjoint sets of vertices, C separates A and B provided
every path from a vertex X 2 A to a vertex Y 2 B contains an element
of C,

4. a clique is a set of vertices of a graph that are all adjacent to each other,
and
5. a clique is maximal if it is not possible to add another vertex to it and
still have a clique.

Item 3. could be equivalently stated as "C separates A and B provided upon


removing the vertices in C from the graph there is no path from a vertex in A
to vertex in B."
The simple example below can be used to illustrate some of this terminology.
In Figure 46 fX1 ; X3 g and fX4 g are separated by fX2 g, fX3 g and fX4 g are
separated by fX2 g, fX1 ; X2 g is a clique, and fX1 ; X2 ; X3 g is a maximal clique.

233
Figure 46: An undirected graph.

For a vector of random quantities (and vertices) X = (X1 ; X2 ; : : : ; Xk ) for


each i and j let X ij stand for all elements of fX1 ; X2 ; : : : ; Xk g except elements
i and j. For P the distribution of X, we may associate with P a pairwise
Markov graph G by
failing to connect Xi and Xj with an edge if and only if Xi k Xj j X ij

k
A pairwise Markov graph for P can be made by considering only pairwise
2
conditional independence questions. But as it turns out, many other conditional
independence relationships can be read from it. That is, it turns out that if
G is a pairwise Markov graph for P , then for non-overlapping sets of vertices
A; B; and C and corresponding subvectors of X respectively X A ; X B , and X C
C separates A and B ) X A k X B j X C (196)
If, for example, Figure 47 is a pairwise Markov graph for a distribution P
for X1 ; X2 ; : : : ; X5 , we may conclude from implication (196) that
(X1 ; X2 ; X5 ) k (X3 ; X4 ) and X2 k X5 j X1

Figure 47: A pairwise Markov undirected graph for P .

Condition (196) says that graph separation implies conditional indepen-


dence. An apparently stronger relationship would be the equivalence of the
graphical and probabilistic conditions. For P a joint distribution for X1 ; X2 ; : : : ; Xk
and G an undirected graph, we will say that P is globally G Markov provided
for all non-overlapping sets of vertices A; B; and C
C separates A and B , X A k X B j X C
Then as it turns out,
P is globally G Markov , G is a pairwise Markov graph associated with P

234
so that separation on a pairwise Markov graph is equivalent to conditional in-
dependence.
An important question is "What forms are possible for densities when P
is globally G Markov?" An answer is provided by the famous Hammersley-
Cli¤ord Theorem. This promises that if joint pmf pX (x) > 0 for all x and
fC1 ; C2 ; : : : ; Cm g is the set of all maximal cliques for a pairwise Markov graph
G associated with P , then
m
Y
pX (x) / i (xCi ) (197)
i=1

for some functions i ( ) > 0. A potentially more natural but less parsimonious
representation is that (if again joint pmf pX (x) > 0 for all x)
Y
pX (x) / ij (xi ; xj ) (198)
i<i s.t. Xi Xj

for some functions ij ( ) > 0.


As for the directed case, the independence relationships implied by an asso-
ciated Markov (undirected) graph enforce simplicity on a joint distribution P .
Take, for example, the situation represented by Figure 46, supposing X takes
4
values in the small set f1; 2; 3g . In general, a pmf pX (x) for this sample space
is speci…ed by 34 1 = 80 probabilities. But the set of maximal cliques for the
undirected graph in Figure 46 is ffX1 ; X2 ; X3 g ; fX2 ; X4 gg so that form (197)
promises that no more than 33 + 32 = 36 values are needed to specify P that is
globally or pairwise G Markov for the undirected graph in Figure 46. Or, since
there are 4 edges on the graph in Figure 46, form (198) promises that no more
than 4 32 = 36 values are needed to specify P .
The same issues raised regarding the practical use of directed graphical mod-
els arise for undirected graphical models. What forms are appropriate for what
kinds of problems? How does one infer the nature of the ( ) appearing in form
(197) or (198) from training data, especially when some of the variables involved
are latent/hidden/unobserved? Again, researchers who value this kind of mod-
eling must obviously produce tractable and believable/well-motivated forms G
and corresponding forms for the ( ) that lead to e¤ective specialized …tting
methods for the kinds of training data they expect to encounter.

18.2.1 Restricted Boltzmann Machines


One particular version of undirected graphical modeling that has seen recent
interest in machine learning applications is that of Restricted Boltzmann Ma-
chines. We will here consider the simplest of these models, where all variables
are binary.49 In this kind of model, nodes are arranged in two layers and
4 9 For some purposes 0/1 coding is most convenient and for others 1/1 coding of variables
is most helpful. What follows can be read with either in mind. The class of models produced
does not depend on this choice, only the interpretation of parameters of those models.

235
there are edges only between nodes on di¤erent layers, not between nodes in
the same layer. One layer of nodes is called the "hidden layer" and the other
is called the "visible layer." Typically the nodes in the visible layer correspond
to (digital versions) of variables that are (at least at some cost) empirically
observable, while the variables corresponding to hidden nodes are completely
latent/unobservable and somehow represent some stochastic physical or mental
mechanism. In addition, it is convenient in some contexts to think of visible
nodes as being of two types, say belonging to a set V1 or a set V2 . For example,
in a prediction context, the nodes in V1 might encode "x"/inputs and the nodes
in V2 might encode y/outputs. We’ll use the naming conventions indicated in
Figure 48.

Figure 48: An undirected graph corresponding to a restricted Boltzmann ma-


chine with l + m + n total nodes. l nodes are hidden and the m + n visible
nodes break into the two sets V1 and V2 .

Then for l + m + n real parameters k and l (m + n) real parameters ij (for


1 i l and l + 1 j l + m + n) the functions

i j
ij (hi ; vj j ) = exp hi + vj + ij hi vj
m+n l

can be used in form (198) to produce a pmf for (h; v) for which Figure 48
provides a pairwise Markov graph. For this form
0 1
Xl l+m+n
X l l+m+n
X X
p (h; vj ) / exp @ i hi + j vj + ij hi vj
A
i=1 j=l+1 i=1 j=l+1

and thus
Pl Pl+m+n Pl Pl+m+n
exp i=1 i hi + j=l+1 j vj + i=1 j=l+1 ij hi vj
p (h; vj ) = P Pl Pl+m+n Pl Pl+m+n ~
~ v ) exp
~ +
i hi jv
~j + ij hi v
~j
(h;~ i=1 j=l+1 i=1 j=l+1
(199)

236
Let the normalizing constant that is the denominator on the right of display
(199) be called ( ) and note the obvious fact that for these models
l
X l+m+n
X l l+m+n
X X
ln (p (h; vj )) = i hi + j vj + ij hi vj ln ( ( )) (200)
i=1 j=l+1 i=1 j=l+1

Observe that such a model can have as many as

l + m + n + l (n + m) = l + (l + 1) (n + m)

non-zero real parameters.


Because for typical applications 2l+n+m can be very large (and thus compu-
tation of ( ) can be prohibitive) it is not common to simulate (h; v) directly.
But since di¤erences in the log probabilities in display (200) for the two possible
values of an hi or vj are very simple, conditional probabilities for a single hi
or vj are very easy to …nd, Gibbs sampling algorithms can be used to gener-
ate observations (h; v) from the pmf (199) for …xed . Similarly, it is easy to
hold …xed part of (h; v) (for example corresponding to nodes in V1 ) and simulate
from conditional distributions via Gibbs sampling (for …xed ), thereby enabling
approximately optimal prediction of the rest of (h; v) (and, in particular, the
nodes corresponding to V2 ).
This all looks attractive/promising, but there are three fundamental di¢ -
culties related to these models, namely

1. the matter of …tting a vector of coe¢ cients is problematic,


2. the fact that many (?"most"?) Boltzmann machines are nearly degenerate,
concentrating their probability on relative few di¤erent vectors and
3. (related to 2.) many (?"most"?) Boltzmann machines further have the
unpleasant property that change of a value of a single entry of (h; v) can
cause wide swings in the probability (199).

The …rst of these issues is well-recognized. The second and third seem far less
well-appreciated and make these models often less than ideal for representing
observed real variation in v.
What will be available as training data for an RBM is some set of (poten-
tially incomplete) vectors of values for visible nodes, say v i for i = 1; : : : ; N
(that one will typically assume are independent and from some appropriate
marginal distribution for visible vectors derived via summation from the over-
all joint distribution of values associated with all nodes visible and hidden).
Notice now that even in a hypothetical case where one has "data" consisting
of complete (h; v) pairs, the existence of the unpleasant normalizing constant
Q
N
( ) would typically make optimization of a likelihood p (hi ; v i j ) or log-
i=1
PN
likelihood i=1 ln p (hi ; v i j ) problematic. But the fact that one must sum out
over (at least) all hidden nodes in order to get contributions to a likelihood or

237
loglikelihood makes the problem even more computationally di¢ cult. That is,
if an ith training case provides a complete visible vector v i , the corresponding
likelihood term is the marginal of that visible con…guration
X
p (v i j ) = ~ vi j
p h;
~
h

And the computational situation becomes even more unpleasant if an ith train-
ing case provides, for example, only values for variables corresponding to nodes
in V1 (say v 1i ), since the corresponding likelihood term is the marginal of only
that visible con…guration
X
p (v 1i j ) = p h;~ (v 1i ; v
~2 ) j
~ and v
h ~2

Substantial e¤ort in computer science circles has gone into the search for
"learning" algorithms aimed at …nding parameter vectors that produce large
values of loglikelihoods based on N training cases (each term based on some
marginal of p (h; vj ) corresponding to a set of visible nodes). These seem
to be mostly based on approximate stochastic gradient descent ideas and ap-
proximations to appropriate expectations based on short Gibbs sampling runs.
Hinton’s notion of "contrastive divergence" appears to be central to the most
well known of these. Work of Kaplan et al. calls into question even the possi-
bility of completely rational means of …tting Boltzmann machines by appeal to
any standard statistical principles.
Beyond the …tting problem is the nature of many …tted Boltzmann Machines.
These seem to typically seriously under-represent the kind of variability seen in
v i s in training sets and have marginals p (vj ) that are highly sensitive to small
(one-coordinate) changes in v. For some theory in this direction, see again
work of Kaplan et al. These issues seem to be real drawbacks to the use of
RBMs in data modeling.
Some of what is termed "deep learning" seems to be based on the notion of
generalizing RBMs to more than a single hidden layer and potentially employs
visible layers on both the top and the bottom of an undirected graph. A cartoon
of a deep learning network is given in Figure 49. The fundamental feature of the
graph architecture here is that there are edges only between nodes in successive
layers.
If the problem of how to …t parameters for a RBM is a practically di¢ -
cult/?impossible? one, …tting a deep network model (based on some training
cases consisting of values for variables associated with some of the visible nodes)
is clearly going to be doubly problematic. What seems to be currently popular
is some kind of "greedy"/"forward" sequential …tting of one set of parameters
connecting two successive layers at a time, followed by generation of simulated
values for variables corresponding to nodes in the "newest" layer and treating
those as "data" for …tting a next set of parameters connecting two layers (and so
on). But a deep learning network like that portrayed on Figure 49 compounds

238
Figure 49: An hypothetical "deep" generalization of a RBM.

the kinds of …tting issues raised for RBMs and principled methods seem lacking.
Further, degeneracy and instability issues like 2. and 3. above also are manifest.
If the …tting issues for a deep network were solvable for a given architecture
and form of training data, some interesting possibilities for using a good "deep"
model have been raised. For one, in a classi…cation/prediction context, one
might treat the bottom visible layer as associated with an input vector x, and the
top layer as also visible and associated with an output y or y. In theory, training
cases could consist of complete (x; y) pairs, but they could also involve some
incomplete cases, where y is missing, or part of x is missing, or ... so on. (Once
the training issue is handled, simulation of the top layer variables from inputs
of interest used as bottom layer values again enables classi…cation/prediction.)
As an interesting second possibility for using deep structures, one could con-
sider architectures where the top and bottom layers are both visible and encode
essentially (or even exactly50 ) the same information and between them are sev-
eral hidden layers with one having very few nodes (like, say, two nodes). If one
can …t such a thing to training cases consisting of sets of values corresponding
to visible nodes, one can simulate (again via Gibbs sampling) for a …xed set
of "visible values" corresponding values at the few nodes serving as the narrow
layer of the network. The vector of estimated marginal probabilities of a latent
value 1 at those nodes might then in turn serve as a kind of pair of "generalized
principal component" values for the set of input visible values. (Notice that in
5 0 I am not altogether sure what are the practical implcations of essentially turning the kind

of "tower" in Figure 49 into a "band" or ‡at bracelet that connects back on itself, the top
hidden layer having edges directly to the bottom visible layer, but I see no obvious prohibition
of this architecture.

239
theory these are possible to compute even for incomplete sets of visible values.)

19 Special Bayes Methods for Statistical Learn-


ing
19.1 Relevance Vector Machines
It is a theme that runs through much of modern prediction practice that shrink-
age and smoothing and control of complexity of predictors is an essential part of
…nding e¤ective versions of them. This is often accomplished by shrinking …tted
parameter vectors of models toward regions of a parameter space corresponding
to relatively simple sub-models.51 Some versions of partially Bayes prediction
methodology meant to enforce such parameter induced simplicity/sparsity seem
to go under the name "relevance vector machines." (Presumably this is about
identifying a few "relevant" parameters of a model that should not be "zeroed
out.") Here we consider some ideas in this direction for parameter vectors
that provide coe¢ cients for linear forms upon which SEL or 0-1 loss predictors
are to be built.
Suppose that for p predictors and N cases one creates from the p inputs
a feature matrix H using q real-valued functions hj (x). This could be the
N q
original data matrix X in the case that q = p, a Gram matrix K for a kernel
N p N N
K (made with every hj (x) = K (x; xj )) in the case q = N , or simply a matrix
of values for some set of basis functions h1 ; : : : ; hq (each mapping <p ! <).
The idea is that a predictor of the output y is to be built on the product
(h1 (x) ; : : : ; hq (x)) for a q-vector of parameters and that sparsity in the
parameter vector (few entries of any appreciable size) corresponds to simplicity
of the …nal predictor. The search for e¤ective "relevance vector machines" is
the search for prior distributions for that encode sparsity.52
Probably the simplest and most popular priors for this problem are so-called
spike and slab priors. These are priors for that for a large number B,
a small positive constant , and 0 the point mass distribution at 0 make its
entries iid, with each j having the distribution

N 0; B 2 + (1 ) 0 or U ( B; B) + (1 ) 0

This kind of prior distribution is obviously symmetric in j and puts much of its
prior probability on hyperplanes where some of the entries of are 0. Cor-
5 1 The support vector machine idea is an instance of this, where a relatively few "support

vectors" of N possible training vectors are represented in formulas for optimal linear voting
functions, because all others have coe¢ cients that are "zeroed out" in the …tting process
(this ultimately corresponding putting parameter vectors on 0-coordinate hyperplanes in an
(N + 1)-dimensional parameter space).
5 2 The fact that a posterior density is proportional to the product of the likelihood function

and a prior density implies that the way to get posterior sparsity is to use prior distributions
that put most of their mass on or near "simple sub-model" parts of the parameter space.

240
responding posterior distributions based on a training set typically concentrate
on and near those hyperplanes.
The lasso development of Section 3.1.2 and the notion that penalization in
normal data models is strongly related to use of priors whose log densities are
proportional to the penalty functions suggests another possibility. That is the
use of priors for that for a single > 0 and an a single exponent 0 < r 1
make its entries iid with each j having density proportional to
r
exp ( j jj )
(The r = 1 case is that of independent doubly exponential priors for the entries
of .) For large this symmetric prior again concentrates much of its mass near
hyperplanes where some of the entries of are 0.
A third sparsity-inducing prior for employs a set of q additional hyper-
parameters 1 ; : : : ; q and conditional on these makes the entries of independent
with j N 0; 1= 2j (the variance is 1= 2j ). Then using independent proper
hyper-priors
j (a; b) for small positive a and b
or independent Je¤reys improper hyper-priors with
1
ln j U (<)
2
or proper approximations to the Je¤reys priors like
1
ln j U ( B; B) for large B
2
often posteriors for many of the j have large mass far from 0 and correspond-
ingly encode large concentration of mass for many of the j near 0.
The third possibility above has some corresponding analytical results that
can be used to approximate posteriors, but the most direct way of processing a
training set and prior distribution to produce usable posterior results is through
the use of standard Bayes MCMC software. In SEL prediction problems, one
can for example combine a prior distribution for with a likelihood derived
from a model for independent
2
yi N 0 + (h1 (xi ) ; : : : ; hq (xi )) ;
(and appropriate priors for 0 and 2 ). In 2-class classi…cation problems (with
0-1 coding) one can combine a prior distribution for with a likelihood derived
from a model for independent
1
yi Bernoulli
1 + exp ( ( 0 + (h1 (xi ) ; : : : ; hq (xi )) ))

(and an appropriate prior for 0 ). In both cases, standard Bayes software can
be used to identify a (typically sparse) high-posterior density parameter vector
^ and then a sensible SEL predictor

f^ (x) = ^0 + (h1 (x) ; : : : ; hq (x)) ^

241
in the …rst case and a sensible 0-1 loss classi…er
h i
f^ (x) = I ^0 + (h1 (x) ; : : : ; hq (x)) ^ > 0

in the second.

19.2 Dirichlet and Data-Derived Priors for Prediction Based


on Normal Mixture Models
A basic reality of high-dimensional (large p) prediction is that it is rare to
encounter a problem where a single simple relationship between input x and
output y holds across a large input space. What one might hope for, however, is
to …nd regions in <p where di¤erent simple relationships hold and to more or less
tie those relationships together (across the relevant part of <p ) probabilistically.
Work of Lanker, Ryan, Culp, Vardeman, and Morris has been built on this idea
and (multivariate) normal mixture models. This section outlines some of that.
For
y
MVNp+1 ( ; )
x
E[yjx] is a very simple linear function of x. So if a K-vector of (mixture)
PK
probabilities = ( 1 ; 2 ; : : : ; K ) (of course with k=1 k = 1) and K means
k and covariance matrices k together specify a mixture distribution and

K
X
y
k MVNp+1 ( k; k) (201)
x
k=1

then for p (xj k ; k ) the kth (marginal) component density of x and Ek [yjx]
the kth (linear) conditional mean function, then
K
!
X k p (xj k ; k )
E [yjx] = PK Ek [yjx] (202)
k=1 k=1 k p (xj k ; k )

where the multiplier of conditional mean k (call it k (x)) is the conditional


probability that x is from the corresponding component of the mixture. Armed
with the mixture probabilities, means, and covariance matrices, formula (202)
provides a "locally" (where some k (x) is essentially 1) simple (linear) SEL pre-
dictor for y. Of course, one doesn’t know parameter values needed to compute
the predictor except through a training set T .
It turns out that it often does not work well in practice to simply esti-
mate the parameters of the mixture distribution (201) from a training set and
plug those estimates into form (202) to make a SEL predictor for y. But
what Lanker et al. have found to be e¤ective is based on a Bayes model and
use of latent "component identity" variables. That is, for a training set53
(x1 ; y1 ) ; : : : ; (xN ; yN ) invent N corresponding latent variables k1 ; : : : ; kN each
5 3 Without loss of generality, suppose that every xj and y has been standardized.

242
taking values in f1; 2; : : : ; Kg iid with marginal distribution speci…ed by . If
conditioned on ki each
yi
MVNp+1 ki ; ki
xi
the training set has N observations that are iid according to the mixture distri-
bution (201). Then for a prior distribution

Dirichlet ( 1; : : : ; K) (203)

(all ks the same works well in practice) and independent priors


iid iid
k g1 ( ) and k g2 ( ) (204)

MCMC algorithms for sampling from the posterior distribution of the parame-
ters of the mixture and the latent variables are easy to …nd. Iterates of the
parameters vectors from such an algorithm produce iterates of the functional
form (202) and averaging across iterates can produce a workable predictor. This
works by essentially picking out "the right" regions and linear functions where
linearity of prediction is warranted and by identifying "the right" number of
components (less than or equal to K) to assign appreciable entries in . But it
seems that there are three requirements for the forms g1 ( ) and g2 ( ) in order
for this program to be e¤ective. These are that 1) some kind of conjugacy is
needed in order to make Gibbs sampling applicable and the method practical, 2)
"locations" for g1 ( ) need to be "right" and "scales" for g2 ( ) need to be ‡exible,
and 3) neither "‡at"/uninformative nor very "sharp"/informative distributions
work well for forms g1 ( ) and g2 ( ).
For purposes of making an e¤ective predictor (not for purposes of a philo-
sophically "proper" Bayes analysis) it proves e¤ective to employ g1 ( ) derived
from the training set. One can e¤ectively use a multivariate density estimate
based on the observed vectors in the training set and a spherical normal kernel
N
1 X yi 2
g1 ( ) = p j ; I
N i=1 xi

for an appropriate bandwidth : (One can simulate from this prior by pick-
ing a training case at random and adding to it a MVNp+1 0; 2 I random
perturbation.)
Further (still for purposes of making an e¤ective predictor) it proves e¤ective
to employ for g2 ( ) an equally weighted mixture of inverse Wishart densities with
corresponding "means" 2 I and minimum (namely p + 3) degrees of freedom for
2 f:01; :02; : : : ; 1:00g. This mixture prior allows di¤erent scales for di¤erent
components of form (201) and gives a group of training cases i with common ki
maximum e¤ect on what values of ki have large posterior probability (tending
to make ki look like the group sample covariance matrix).
The product of g1 ( ) and g2 ( ) is then a mixture of joint densities conjugate
in the one-sample multivariate normal problem and is thus easy to handle in

243
Gibbs sampling. Gibbs updating of is similarly easy using the ki s, and
Gibbs updates of those are easily handled because they are discrete. In all,
the data-dependent "prior" distribution speci…ed in displays (203) and (204) is
computationally attractive and leads to e¤ective SEL prediction.

19.3 Bayes Mixture Analyses for Binary Vectors


The foregoing material on Dirichlet priors for multivariate normal mixtures has
an interesting parallel in a class of symmetric component models for (binary)
p
vectors x 2 f0; 1g . We here brie‡y discuss recent wok of Chakraborty and
Vardeman in this direction.
p
This begins from the exponential family of distributions on f0; 1g de…ned
p
for parameters 2 f0; 1g and 2 (0; 1) by the pmf
kx k2
1
p (xj ; ) = p
(205)
1 p
2
kx k
2
The parameter is a "central value" for the distribution and kx k is sim-
ply the number of coordinates at which x and di¤er. The total probability
assigned (uniformly) to those x which di¤er from in m coordinates decreases
geometrically in m, and thus functions as a "spread" parameter for the dis-
tribution. (Small values of yield distributions highly concentrated at and
p
large s have corresponding distributions approximately uniform on f0; 1g .)
p
Since an arbitrary distribution for binary vectors x 2 f0; 1g is de…ned by 2p
probabilities, it is obviously possible to approximate any distribution for binary
vectors with a mixture of distributions with pmfs (205) to any desired degree
of precision. What is really more interesting is the possibility of describing
distributions for binary vectors in many applications as mixtures with relatively
few components, the s representing modes or representative cases in the space
p
f0; 1g and the corresponding s controlling how many changes of coordinates
away from these modes are likely.
p
For local notational convenience, call a distribution on f0; 1g of the form
(205) the SBEFp ( ; ) distribution (for Symmetric Binary Exponential Family).
So a K-vector of (mixture) probabilities = ( 1 ; 2 ; : : : ; K ) (of course with
PK
k=1 k = 1) and K centers k and spread parameters k together specify a
mixture distribution
XK

k SBEFp ( k ; k ) (206)
k=1

that is a potentially useful basis of modeling and inference.


As in the MVN case of the previous discussion, it proves helpful to invent
and use latent "component identity" variables. That is, for a training set
x1 ; x2 ; : : : ; xN , invent N corresponding variables k1 ; k2 ; : : : ; kN each taking val-
ues in f1; 2; : : : ; Kg iid with marginal distribution speci…ed by . If conditioned

244
on ki each
xi SBEFp ki ; ki

the training set has N observations that are iid according to the mixture distri-
bution (206). Then for a prior distribution

Dirichlet ( 1; : : : ; K)

(all ks the same works well in practice) and independent priors


iid p iid
k U (f0; 1g ) and k g( ) (207)

where
s
2 p p+1 2 p+2 2p+2
1 (p + 1) + 2p (p + 2) (p + 1) +
g( ) /
(1 p+1 )2 (1 )
2

(this latter is a univariate Je¤reys prior for in a model where is known


to be 0) MCMC algorithms for sampling from the posterior distribution of the
parameters of the mixture and the latent variables are easy to …nd. Iterates
of the parameters vectors from such an algorithm produce a number of useful
quantities. s that appear with highest frequency in the iterates are candidates
for modes/representative cases of binary vectors. The average across iterates
of the probability assigned by the mixture to a value x serves as a posterior
mean for the probability of that value. And for any two indices i and i0 , the
relative frequency among iterates with which ki = ki0 serves as an approximate
posterior probability that case i and case i0 share a common origin and can be
used as bases for clustering cases, much as was suggested for Zhou’s Bayesian
SOM and Chakraborty’s Bayes biclustering.

Part VII
Appendices
A Exercises
A.1 Section 1.2 Exercises
These are exercises intended to provide intuition that data in <p are necessarily
"sparse." The realities are that <p is "huge" and for p at all large, "…lling up"
even a small part of it with data points is e¤ectively impossible and our intuition
about distributions in <p is very poor.

1. (6HW-11) Let Qp (t) and qp (t) be respectively the 2p cdf and pdf. Con-
sider the MVNp (0;I) distribution and Z 1 ; Z 2 ; : : : ; Z N iid with this distribution.
With
M = min fkZ i k ji = 1; 2; : : : ; N g

245
write out a one-dimensional integral involving Qp (t) and qp (t) giving EM .
Evaluate this mean for N = 100 and p = 1; 5; 10; and 20 either numerically
or using simulation.

2. (6HW-13) For each of p = 1; 5; 10; and 20, generate at least 1000 realizations
of pairs of points x and z as iid uniform over the p-dimensional unit ball (the set
of x with kxk 6 1). Compute (for each p) the sample average distance between
x and z. (For Z MVNp (0;I) independent of U U(0; 1) ; x = U 1=p = kZk Z
is uniformly distributed in the unit ball in <p .)

3. (5HW-14) For each of p = 10; 20; 50; 100; 500; and 1000, make n = 10; 000
draws of distances between pairs of independent points uniform in the cube
p
[0; 1] . Use these to make 95% con…dence limits for the ratio

mean distance between two random points in the cube


maximum distance between two points in the cube

4. (5HW-14) For each of p = 10; 20; 50, make n = 10; 000 random draws of
p
N = 100 independent points uniform in the cube [0; 1] . Find for each sample
of 100 points, the distance from the …rst point drawn to the 5th closest point of
the other 99. Use these to make 95% con…dence limits for the ratio
mean diameter of a 5-nearest neighbor neighborhood if N = 100
maximum distance between two points in the cube

p
5. (5HW-14) What fraction of random draws uniform from the unit cube [0; 1]
p
lie in the "middle part" of the cube [ ; 1 ] , for a small positive number ?

The next 3 problems are based on nice ideas taken from Giraud’s
book.

6. (6HW-15) For p = 2; 10; 100; and 1000, draw samples of size N = 100 from
p
the uniform distribution on [0; 1] . Then for every (xi ;xj ) pair with i < j in
one of these samples, compute the Euclidean distance between the two points,
100
kxi xj k. Make a histogram (one p at a time) of these distances.
2
What do these suggest about how well "local" prediction methods (that rely
only on data points (xi ; yi ) with xi "near" x to make predictions about y at x)
can be expected to work?

7. (6HW-15) Consider …nding a lower bound on the number of points xi (for


p p
i = 1; 2; : : : ; N ) required to "…ll up" [0; 1] in the sense that no point of [0; 1]
is Euclidean distance more than away from some xi .

246
The p-dimensional volume of a ball of radius r in <p is
p=2
Vp (r) = rp
(p=2 + 1)

and Giraud notes that it can be shown that as p ! 1

Vp (r)
p=2
!1
2 er 2 1=2
p (p )

Then, if N points can be found with corresponding -balls covering the unit
cube in <p , the total volume of those balls must be at least 1. That is

N Vp ( ) > 1

What then are approximate lower bounds on the number of points required to
p
…ll up [0; 1] to within for p = 20; 50; and 200, and = 1; :1; and :01? (Giraud
notes that the p = 200 and = 1 lower bound is larger than the estimated
number of particles in the universe.)

8. (6HW-15) Giraud points out that for large p, most of MVNp (0;I) proba-
bility is "in the tails." For qp (x) the MVNp (0;I) pdf and 0 < < 1 let
n o
Bp ( ) = fxjqp (x) > qp (0)g = xj kxk 6 2 ln
2 1

be the "central"/"large density" part of the multivariate standard normal dis-


tribution.
a) Using the Markov inequality, show that the probability assigned by the
multivariate standard normal distribution to the region Bp ( ) is no more than
1= 2p=2 .
b) What then is a lower bound on the radius of a ball at the origin (call
it r (p)) required so that the multivariate standard normal distribution places
probability :5 in that ball? What is an upper bound on the ratio qp (x) =qp (0)
outside the ball with radius that lower bound? Plot these bounds as functions
of p for p 2 [1; 500].

A.2 Section 1.3 Exercises


1. (6HW-17)
a) Argue carefully that for inherently non-negative response y with loss
2
y^ + 1
L (^
y ; y) = ln
y+1

a theoretically optimal predictor is

f (x) = exp (E[ln (y + 1) jx]) 1

247
b) The Zillow Kaggle game for predicting (positive) house prices used the
loss function
2
2 y^
L (^
y ; y) = (ln y^ ln y) = ln
y
Identify the function of x, call it f (x), that based on a joint distribution P for
(x; y) optimizes
EL (g (x) ; y)
over choices of function g (x).

2. (6HW-13) Consider the loss function L (^ y ; y) = (1 y y^)+ for y taking


values in f 1; 1g and prediction y^ in <. Suppose that P [y = 1] = p. Write
out the expected (over the randomness in y) loss of prediction y^. Plot this
as a function of y^ for the cases where …rst p < :5 and then p > :5. (These
are continuous functions that are linear on the intervals ( 1; 1) ; ( 1; 1) ;and
(1; 1).) What is an optimal choice of y^ (depending upon p)?

3. (5E1-18) Consider predictors of y 2 < for x 2 [0; 1] based on linear combi-


nations of the small set of "features" (functions of x)

1 1 2 2
h1 (x) = I 0 6 x 6 ; h2 (x) = I <x6 ; and h3 (x) = I <x61
3 3 3 3
and the very small training set

Case (i) 1 2 3 4 5 6
yi 0 4 10 12 6 10
xi :1 :3 :4 :6 :7 :9

a) Without bothering to center y, consider using OLS to …t a predictor for y


of the form f^ (x) = b1 h1 (x) + b2 h2 (x) + b3 h3 (x) to the training set. Evaluate
the LOOCV RMSPE for this kind of predictor.
b) Center the response (leaving the input as is) and …t a predictor (for
centered response) of the form f^ (x) = b1 h1 (x)+b2 h2 (x)+b3 h3 (x) via penalized
least squares with penalty b21 + b22 + b23 for > 0. (Give formulas for the 3
coe¢ cients.)

4. (5E1-18) Use the same training set as in Problem 3 above and without
bothering to center y, …nd the 1-nn SEL predictor for y, say f^1-nn (x), and
evaluate its LOOCV MSPE. (Specify values of the predictor for all x 2 [0; 1]
except where there are "ties."

5. (5HW-18) Consider the Ames House Price dataset and possible predic-
tors of Price. In particular, consider the p = 4 inputs Size, Fireplace, Base-
mentbath, and Land. There are, of course, 24 = 16 possible multiple linear
regression predictors to be built from these features (including the one with no
covariates employed). Use both LOOCV and repeated 8-fold cross-validation

248
implemented through caret train() to compare these 16 predictors in terms
of cross-validation root mean squared prediction errors.

6. (5HW-18) Consider the famous "Glass Identi…cation" dataset of German


on the UCI Machine Learning Data Repository and k-nn classi…cation between
glass Types 1 and 2.
a) Use both LOOCV and repeated 10-fold cross-validation to …nd what you
believe to be a best number of neighbors for this prediction task.
b) For your choice of k =number of neighbors in a), the variable t (x) =number
of Type 2 cases in the k-nearest neighborhood of x can take values 0; 1; 2; : : : ; k.
The nearest neighbor classi…er classi…es to Type 2 if t (x) > k=2. This is based
on N = 146 training cases of which 70 are of glass Type 1 and 76 are of glass
Type 2. Suppose that you want to use 1 = :7 and 2 = :3 and 0-1 loss. How,
if at all, would you modify the ordinary k-nn classi…er?

7. (5HW-14) Consider 4 di¤erent continuous distributions on the 2-dimensional


2
unit square (0; 1) with densities on that space

p ((x1 ; x2 ) j1) = 1; p ((x1 ; x2 ) j2) = x1 + x2 ; p ((x1 ; x2 ) j3) = 2x1 ;


and p ((x1 ; x2 ) j4) = x1 x2 + 1

For a 0-1 loss K = 4 classi…cation problem, …nd explicitly and make a plot
showing the 4 regions in the unit square where an optimal classi…er f has f (x) =
k (for k = 1; 2; 3; 4) …rst if = ( 1 ; 2 ; 3 ; 4 ) is (:25; :25; :25; :25) and then if it
is (:2; :2; :3; :3).

8. (5E1-14) Suppose that (unbeknownst to a statistical learner) x U(0; 1)


and E[yjx] = I [:45 < x < :55] (that is, the conditional mean of y given x is 1
when :45 < x < :55 and is 0 otherwise). A 3-nearest-neighbor predictor, f^N ,
is based on N data pairs, and f^N (:5) has conditional means given the values of
the inputs in the training set:

0 if no xi is in (:45; :55)
1=3 if one xi is in (:45; :55)
2=3 if two xi s are in (:45; :55)
1 if three or more xi s are in (:45; :55)

What is the value of the bias of the nearest neighbor predictor at :5? Does this
bias go to 0 as N gets big? Argue carefully one way or the other.

9. (6E1-19) Below is a representation of an N = 6 toy 2-class classi…cation


training set with p = 2. Find the 0-1 loss LOOCV error rate for the 3-nearest-

249
neighbor classi…er based on these training cases.

10. (6HW-11) Consider SEL prediction. Suppose that in a very simple prob-

lem with p = 1, the distribution P for the random pair (x; y) is speci…ed by
x U (0; 1) and yjx N x2 ; (1 + x)
((1 + x) is the conditional variance of the output). Further, consider two possi-
ble sets of functions S = fgg for use in creating predictors of y, namely
1. S1 = fgjg (x) = a + bx for real numbers a; bg ; and
( )
P
10
2. S2 = gjg (x) = aj I 10 < x 6 10 for real numbers aj
j 1 j
j=1

Training data are N pairs (xi ; yi ) iid P . Suppose that the …tting of elements
of these sets is done by
1. OLS (simple linear regression) in the case of S1 , and
2. according to
8 j 1 j
>
> y if no xi 2 10 ; 10
>
<
a
^j = 1
P
> #xi 2( j101 ; 10
j yi otherwise
>
> ]
: i with
j
xi 2( j101 ; 10 ]
in the case of S2 ,
to produce predictors f^1 and f^2 .
a) Find (analytically) the functions g for the two cases. Use them to …nd
2
the two expected squared model biases Ex (E[yjx] g (x)) . How do these two
compare?
b) For the second case, …nd an analytical form for ET f^2 and then for the
2
average squared …tting bias Ex ET f^2 (x) g (x) . (Hints: What is the con-
2
j 1 j
ditional distribution of the yi given that no xi 2 10 ; 10 ? What is the
conditional mean of y given that x 2 j101 ; 10
j
?)

250
c) For the …rst case, simulate at least 1000 training datasets of size N = 100
and do OLS on each one to get corresponding f^1 s. Average those to get an
approximation for ET f^1 . (If you can do this analytically, so much the better!)
Use this approximation and analytical calculation to …nd the average squared
2
…tting bias Ex ET f^1 (x) g (x) for this case.
1
d) How do your answers for b) and c) compare for a training set of size
N = 100?
e) Use whatever combination of analytical calculation, numerical analysis,
and simulation you need to use (at every turn preferring analytics to numerics
to simulation) to …nd the expected prediction variances Ex VarT f^ (x) for the
two cases for training set size N = 100.
f ) In sum, which of the two predictors here has the best value of Err for
N = 100?

11. (6HW-11) Two …les with respectively N = 100 and then N = 1000 pairs
(xi ; yi ) generated according to P in Problem 10 above are provided with these
notes. Use 10-fold cross validation to see which of the two predictors in Problem
10 looks most likely to be e¤ective. (The datasets will not be sorted, so you
may treat successively numbered groups of 1=10th of the training cases as your
K = 10 randomly created pieces of the training set.)

12. (5HW-14) Again consider SEL prediction. Suppose that (unknown to a


statistician) a mechanism generates iid data pairs (x; y) according to the follow-
ing model:
x U( ; )
2
yjx N sin (x) ; :25 (jxj + 1)
2
(The conditional variance is :25 (jxj + 1) .)
a) What is an absolutely minimum value of Err possible regardless what
training set size, N , is available and what …tting method is employed?
b) What linear function of x (which g (x) = a + bx ) has the smallest
"average squared bias" as a predictor for y? What cubic function of x (which
g (x) = a + bx + cx2 + dx3 ) has the smallest average squared bias as a predictor
for y? Is the set of cubic functions big enough to eliminate model bias in this
problem?

13. (5HW-14) An N = 100 dataset generated according to the model of


Problem 12 is provided with these notes. Use 10-fold cross validation (use the
1st ten points as the fold, the 2nd 10 points as the second, etc.) based on the
dataset to choose among the following methods of prediction for this scenario:

polynomial regressions of orders 0; 1; 2; 3; 4; and 5,


regressions using sets of predictors f1; sin x; cos xg and
f1; sin x; cos x; sin 2x; cos 2xg, and

251
a regression with the set of predictors
1; x; x2 ; x3 ; x4 ; x5 ; sin x; cos x; sin 2x; cos 2x

(Use ordinary least squares …tting.) Which predictor looks best on an


empirical basis? Knowing how the data were generated (an unrealistic luxury)
which methods here are without model bias?

14. (5E1-14) Consider a joint pdf (for (x; y) 2 (0; 1) (0; 1)) of the form
1 y
p (x; y) = exp for 0 < x < 1 and 0 < y
x2 x2
(x U(0; 1) and conditional on x, the variable y is exponential with mean x2 .)
2
a) Find the linear function of x (say + x) that minimizes E(y ( + x)) .
(The averaging is over the joint distribution of (x; y). Find the optimizing in-
tercept and slope.)
b) Suppose that a training set consists of N data pairs (xi ; yi ) that are
independent draws from the distribution speci…ed above, and that least squares
is used to …t a predictor f^N (x) = aN + bN x to the training data. Suppose
that it’s possible to argue that the least squares coe¢ cients aN and bN converge
(in a proper probabilistic sense) to your optimizers from a) as N ! 1. Then
for large N , about what value of (SEL) training error do you expect to observe
under this scenario?

15. (5E1-16) Unknown to statistical learners in a p = 1 SEL prediction


2
problem, x U(0; 6) and yjx N x 3; (x + 1) (the conditional variance is
2
(x + 1) ). A statistical learner uses a class of predictors S consisting of all
functions of the form ga;b (x) = a I [x < 2] + b I [x > 2].
a) In this context, what are

the minimum expected loss possible,


the best element of S, and
the learner’s modeling penalty?

b) Suppose that based on a training set of size N = N1 + N2 where N1


is the count of xi that are less than 2 and N2 is the count of xi that are at
least 2, the …tting procedure used is to take54 a ^ = y1 and ^b = y2 (with the
understanding that if N1 = 0 then a ^ = 0 and if N2 = 0 then ^b = 0). Write an
explicit expression for the …tting penalty here. (Hint: What is the distribution
of N1 ? Given that an xi is less than 2, what are the mean and variance of y?
Given that an xi is at least 2, what are the mean and variance of y?)
c) Suppose that a second statistical learner uses predictors hc;d (x) = c
I [x < 3]+d I [x > 3]. A best such predictor is in fact h 1:5;1:5 (x) = 23 I [x < 3]+
5 4 In the obvious way, y is the sample mean output for inputs x < 2 and y is the sample
1 i 2
mean output for inputs xi > 2.

252
3
2I [x > 3]. Find a linear combination of the best element of S you identi…ed
in a) and this best predictor available to the second learner that is better than
either individual predictor.

16. (6HW-13) Using the datasets provided with these notes carry out the
steps of Problems 10 and 11 above supposing that the distribution P for the
random pair (x; y) is speci…ed by

x U (0; 1) and yjx Exp x2

(the exponential mean is x2 ).

17. (6HW-15) Using the datasets provided with these notes carry out the
steps of Problems 10 and 11 above supposing that the distribution P for the
random pair (x; y) is speci…ed by
2 2
x U (0; 1) and yjx N (3x 1:5) ; (3x 1:5) + :2

2
(the Gaussian variance is (3x 1:5) + :2).

18. (6E1-15) Consider a SEL prediction problem where p = 1 and the


class of functions used for prediction is (the set of constant functions) S =
fhjh (x) = c 8x and some c 2 <g. Suppose that in fact

x U (0; 1) ; E [yjx] = ax + b; and Var [yjx] = dx2 for some d > 0

a) Under this model, what is the best element of S, say g , for predicting
y? Use this to …nd the average squared model bias in this problem.
b) Suppose that based on an iid sample of N points (xi ; yi ), …tting is done
by least squares (and thus the predictor f^ (x) = y is employed). What is the
average squared …tting bias in this case?
c) What is the average prediction error, Err, when the predictor in b) is
employed?

19. (6HW-17) Consider a toy 2-class classi…cation model for p = 1, where


2
xjy = 0 is N(0; 1), xjy = 1 is N 1; (:5) (the standard deviation is :5), and
P [y = 0] = :5 = P [y = 1].
a) Compute and plot the function P [y = 1jx].
b) Identify the optimal 0-1 loss classi…er and the best possible expected
loss/error rate in this classi…cation problem. (This is a numerical problem.)
c) Consider the set of "linear" classi…ers

S = fI [x < c] jc 2 <g [ fI [x > c] jc 2 <g

(that make one cut in the real numbers at c and classify one way to the left of
c and the other way to the right of c). Plot as functions of c the risks

E (I [y = 0] I [x < c] + I [y = 1] I [x > c])

253
for classi…ers of the form I [x < c] and

E (I [y = 0] I [x > c] + I [y = 1] I [x < c])

for classi…ers of the form I [x > c]. What is the best element of S (say, g )
and then what is the "modeling penalty" associated with using the class of
predictors/classi…ers S (the di¤erence between the optimal error rate and the
error rate for g )?
d) Suppose that for a training set of size N = 100 (generated at random
from the distribution described in the preamble of this problem), one will choose
a cut point c^ half way between two consecutive sorted xi values minimizing

min [# fyi = 0jxi < cg + # fyi = 1jxi > cg ; # fyi = 1jxi < cg + # fyi = 0jxi > cg]

Then, if

# fyi = 0jxi < c^g + # fyi = 1jxi > c^g 6 # fyi = 1jxi < c^g + # fyi = 0jxi > c^g

one will employ the classi…er f^ (x) = I [x < c^] and otherwise the classi…er f^ (x) =
I [x > c^]. Simulate 10; 000 training samples and …nd corresponding classi…ers f^.
For each f^ compute a (conditional on the training sample) error rate (an average
of two appropriate normal probabilities on half in…nite intervals bounded by c^)
and average across the training samples. What is the "…tting penalty" for this
procedure? Redo this exercise, using a training set of size N = 50. Is the …tting
penalty larger than for N = 100?

20. (6HW-17) Consider the model of Problem 19 above, but change to the
" 1 and 1" coding of classes/values of y.
a) Plot the function g minimizing Eexp ( yg (x)) over all choices of real-
valued g.
Suppose then that one wishes to approximate this minimizer from part a) with
2
a function of the form 0 + 1 (x x)+ 2 (x x) based on a training set. Your
instructor will provide a training set of size N = 100 based on the model of this
problem. Use it in what follows.
b) Use a numerical optimizing routine and identify values ^0 ; ^1 ; ^2 mini-
mizing the empirical average loss
N
1 X 2
R( 0; 1; 2) = exp yi 0 + 1 (xi x) + 2 (xi x)
N i=1

c) Now consider the penalized …tting problem where one chooses to optimize
N
1 X 2 2
R ( 0; 1; 2) = exp yi 0 + 1 (xi x) + 2 (xi x) + 2
N i=1

For several di¤erent values of > 0, plot on the same set of axes, the optimizer
2
from a), the function 0 + 1 (x x)+ 2 (x x) optimizing R ( 0 ; 1 ; 2 ) from
b), and the functions optimizing R ( 0 ; 1 ; 2 ).

254
21. (5E2-14) At a particular input vector of interest in a SEL prediction
problem, say x, the conditional mean of yjx is 3. Two di¤erent predictors,
f^1 (x) and f^2 (x) have biases (across random selection of training sets of …xed
size N ) at this value of x that are respectively :1 and :5. The random vector of
predictors at x (randomness coming from training set selection) has covariance
matrix
f^ (x) 1 :25
Cov ^1 =
f2 (x) :25 1
If one uses a linear combination of the two predictors

f^ensemble (x) = af^1 (x) + bf^2 (x)

there are optimal values of the constants a and b in terms of minimizing the
expected (across random selection of training sets) squared di¤erence between
f^ensemble (x) and 3 (the conditional mean of yjx). Write out and optimize an
explicit function of a and b that (in theory) could be minimized in order to …nd
these optimal constants.

22. (5E1-18) Consider a p = 1 SEL prediction problem where

E [yjx] = x (1 x) ; Var [yjx] = x (1 x) ; and x U (0; 1)

a) Find the expected loss of a theoretically optimal predictor of y, f opt (x).


b) Consider predictors of the form

fc (x) = c1 I [0 6 x < :4] + c2 I [:4 6 x < :6] + c3 I [:6 6 x 6 1]

for real constants c1 ; c2 ;and c3 . Find

E [yj0 6 x < :4] ; E [yj:4 6 x < :6] ; and E [yj:6 6 x 6 1]

and argue that these give optimal values for the constants.
c) Give an explicit expression for the expected loss of the optimal predictor
of the form fc (x). Note that together with the …rst answer this could give the
modeling penalty here.
d) Give an explicit expression for the …tting penalty if based on a training
set of size N , the value cl is estimated by

c^l = yl I [at least one xi is the interval corresponding to cl ]

(where yl is the sample mean response for training cases with xi in the interval
corresponding to cl ).

23. (5HW-18) Consider a SEL prediction problem where p = 1, and the class
of functions used for prediction is the set of linear functions

S = fhjh (x) = b0 + b1 x 8x and some b0 ; b1 2 <g

255
Suppose that in fact

x U (0; 1) ; E [yjx] = x + 2x2 ; and Var [yjx] = :25x2

a) Under this model, what is the best element of S, say g , for predicting
y? Use this to …nd the modeling penalty/average squared model bias in this
problem.
b) What is the smallest possible expected loss here (the mean squared pre-
diction error of the theoretically best predictor, f (x) = x + 2x2 )?

Now consider the situation where N = 50 and simple linear regression (OLS) is
used to choose an element of S based on a training set. Simulate a large number
of training sets (at least 1000 of them) of this size according to the model here
using normal conditional distributions for yjx. For each simulated training set,
…nd the simple linear regression slope and intercept and use these to estimate
the mean vector and covariance matrix for the …tted regression coe¢ cients (for
this sample size and this model). Use the estimated mean and covariance as
follows.
c) Estimate the linear function of x that is the di¤erence between your
answer to a) and the average linear function produced by SLR in this context.
Find the expected square of this di¤erence according to the U(0; 1) distribution
of x. (This is an estimate of the expected squared …tting bias here.)
d) Using your estimated covariance matrix, approximate the function of x
that is the variance (across training sets) of the value on the least squares line
at x. Find the mean of this function according to the U(0; 1) distribution of x.
(This is an estimate of the expected prediction variance.)
e) In light of c) and d) what is the (estimated by simulation) …tting penalty
in this context? What then is an approximate value for Err?

24. (6HW-17) Consider the Ames house price dataset of Problem 5 above
and the famous Wisconsin breast cancer dataset on the UCI Machine Learning
Data Repository. The latter has 683 = 699 16 complete cases (16 cases are
incomplete) with p = 9 numerical characteristics of biopsied tumors, 239 of
which were malignant and 444 which were benign. Use the train() function in
the caret package in R and do the following.
a) Find a best k for k-nn SEL prediction of home selling price …rst using
repeated 8-fold cross-validation, and then LOO cross-validation. Be sure to use
standardized inputs (even for the 0-1 indicators) and to re-standardize for each
fold. Plot the cross-validation root mean squared prediction error as a function
of k. How does the training root mean squared prediction error for the best
k compare to the corresponding cross-validation root mean squared prediction
error?
b) Find a best k for k-nn classi…cation between benign and malignant cases
based on 0-1 loss, …rst using repeated 10-fold cross-validation, and then LOO
cross-validation. Be sure to use standardized inputs and to re-standardize for
each fold. Plot the cross-validation classi…cation error rate as a function of k.

256
How does the training error rate for the best k compare to the corresponding
cross-validation error rate?

25. (5E2-14) Below are class-conditional pmfs for a discrete predictor variable
x in a K = 3 class 0-1 loss classi…cation problem. Suppose that probabilities of
y = k for k = 1; 2; 3 are 1 = :4; 2 = :3; and 3 = :3. For each value of x give
the corresponding value of the optimal (Bayes) classi…er f opt .

ynx 1 2 3 4 5 6
1 :2 :1 :2 :1 :1 :3
2 :1 :1 :3 :3 :1 :1
3 :2 :1 :2 :2 :2 :1

26. (5E2-14) A training set of size N = 3000 produces counts of (x; y) pairs
as in the table below. (Assume these represent a random sample of all cases.)
For each value of x give the corresponding value of an approximately optimal
0-1 loss (Bayes) classi…er f^.

ynx 1 2 3 4 5 6
1 95 155 145 205 105 150
2 305 105 195 140 195 155
3 150 190 160 155 150 245

27. (5E1-20) In a p = 1 SEL prediction problem, suppose that (unknown to


a statistical learner) x U(0; 1) and E[yjx] = x3 .
a) Two sets of functions mapping (0; 1) ! <, S1 = fcgc2< and S2 =
fdxgd2< , might be searched for a suitable function to predict target y based
on input x. Determine which set provides the smaller model bias.
b) Suppose that (again unknown to the statistical learner) Var[yjx] = x2
from which it follows that Ey = 14 and Vary = 336
139
. The statistical learner uses
^
S1 = fcgc2< and the predictor f (x) = yN (the sample mean output from N
training cases). What then is the value of the …tting penalty?

28. (5E1-20) Consider a p = 1 SEL prediction problem. Suppose that a


predictor of the form
0P N 1
xi yi
B i=1 C
f^c (x) = c B
@ P N
C x
A
x2i
i=1

is to be used. (This is a multiple of the least squares slope estimate in a no-


intercept regression model based on the training set.) The multiplier, c > 0
remains to be chosen. Suppose that training data are as follows.

257
x 3 1 0 0 4
y 2 1 0 1 2
a) Write out an explicit form for the leave-one-out-cross-validation-mean-
squared-prediction-error for f^c (x) in this toy example. (This is a function of
the real variable c, say CV (c).)
b) The value of c minimizing CV (c) in a) turns out to be c^ = :9784. Show
this. Why is CV (:9784) not a good indicator of the e¤ectiveness of predic-
tion methodology that in general employs form f^c with c chosen by optimizing
CV (c)? How would you produce a reliable predictor of the performance of f^c^
in this problem? (Explain clearly and completely.)

29. (5E1-20) In a K = 3 class classi…cation problem with p = 1, class condi-


tional pdfs for x on (0; 1) are

p (xj1) = I [0 < x < 1] ; p (xj2) = 3x2 I [0 < x < 1] ;


2
and p (xj3) = 3 (1 x) I [0 < x < 1]

a) With class probabilities 1 = 2 = 3 = 31 and 0-1 loss, give an explicit


form of a theoretically optimal classi…er f (x) and evaluate the minimum possible
overall error rate (the expected loss of your optimal classi…er).
b) With class probabilities 1 = :25; 2 = :375; and 3 = :375 and 0-1
loss, give an explicit form of a theoretically optimal classi…er f (x) and evaluate
the corresponding class-conditional error rates (conditional probabilities of a
misclassi…cation for y = 1; y = 2; and y = 3).

30. Suppose that in a p = 1 context, one is to predict y under squared error


loss on the basis of a training set T = f(x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xN ; yN )g ; and for
N
yi 1 X
ri = and r = ri
xi N i=1

under consideration are the two very simple predictors

f^1 (x) = x and f^2 (x) = r x

Under the usual setup where the N pairs in T are iid according to P independent
of (x; y) P , consider P de…ned by a marginal distribution x U 21 ; 32 and
conditional distributions yjx N x; 2 .
a) Show that
2 13 2
Err1 = E y f^1 (x) = 2
+ ( 1)
12
and that
2 13
Err2 = E y f^2 (x) = 2
+ 2
9N

258
2
so that the …rst predictor is preferable to the second provided ( 1) =4 <
2 2
=3N i.e. provided ( 1) = 2 < 4=3N (a fact that is of no practical use to
a statistical learner not in full possession of the model generating the training
set!).

Consider LOOCV-guided choice between the two simple predictors for this
problem. The LOOCVMSPE for f^1 (x) is
N N N N
1 X 2 1 X 2 1 X 1 X 2
CV1 = (yi xi ) = y 2 yi xi + x
N i=1 N i=1 i N i=1 N i=1 i

P
N
Then for r(i) = N
1
1 rj , the LOOCVMSPE for f^2 (x) is
j6=i;j=1

N N N N
1 X 2 1 X 2 1 X 1 X 2 2
CV2 = yi r(i) xi = y 2 r(i) yi xi + r x
N i=1 N i=1 i N i=1 N i=1 (i) i

So CV1 < CV2 when


N
X N
X N
X N
X
2
2 r(i) yi xi 2 yi xi < r(i) x2i x2i
i=1 i=1 i=1 i=1

that is, when


N
X N
X
2
2 r(i) 1 yi xi < r(i) 1 x2i
i=1 i=1

Thus, the pick-the(-cross-validation-)winner predictor here is


" N N
#
X X
~
f (x) = xI 2 r(i) 1 yi xi < r 2
1 x 2
(i) i
i=1 i=1
" N N
#
X X
+ rxI 2 r(i) 1 yi xi > 2
r(i) 1 x2i
i=1 i=1

The generalization/prediction error for this pick-the-winner predictor is


2
Errptw = E y f~ (x)

which is certainly NOT just min (Err1 ;Err2 ). Further, this prediction error
Errptw is NOT naively approximated by the cross-validation error of the winner

min (CV1 ; CV2 )

259
b) To demonstrate all this, generate 1000 simulated training sets of size
N = 27 and an additional observation pair (x; y) for each of these, using = 2
for values of = 39 ; 59 ; 79 ; : : : ; 15
9 . (This is 7 sets–one for each considered–of
1000 training sets, each of size N = 27.) For each training set, …nd f~ (x)
2
and y f~ (x) and average the squared di¤erences across the 1000 sets in
each group to produce a simulation-based estimate of Errptw for each value
of . How do these averages compare to the values of min (Err1 ;Err2 ) for
these cases? For each value of compare the distribution of 1000 random
values min (CV1 ; CV2 ) produced, to the approximate value of Errptw . Does
the random variable min (CV1 ; CV2 ) appear to be a good estimator of Errptw ?
Does it appear to be biased, and if so, in what direction?
c) Should one wish to make an honest empirical assessment of the likely
performance of f~ (x), what can be done using LOOCV is this. For each "fold"
consisting of case i use the "remainder" consisting of the other N 1 cases
to compute a "remainder i version" of the pick-the-winner predictor, say f~(i) .
PN
That is, let r(i;j) = N 1 2 ri and de…ne
l6=i;l6=j;l=1
2 3
N
X N
X
f~(i) (x) = xI 42 r(i;j) 1 yj xj < 2
r(i;j) 1 x2j 5
j6=i;j=1 j6=i;j=1
2 3
N
X N
X
+ r(i) xI 42 r(i;j) 1 yj xj > 2
r(i;j) 1 x2j 5
j6=i;j=1 j6=i;j=1

and use f~(i) (xi ) in predicting yi . The appropriate LOOCV error is then
PN 2
CVptw = N1 yi f~(i) .
i=1
For the case of in part b) with the worst match between Errptw and the
distribution of the variable min (CV1 ; CV2 ), …nd the 1000 values of CVptw . Does
the random variable CVptw seem to be a better estimator of Errptw than the
naive min (CV1 ; CV2 )? Explain.

31. .Consider the case of random variables C (i) for i 2 I (some index set)
and let C stand for the random vector/function with coordinates/entries C (i).
De…ne the random variable
i = arg minC (i)
i2I

(a minimizer of the entries of C). (We’ll assume enough regularity here that
there are no issues in de…ning this variable or any of the probabilities or expec-
tations used here.)
Suppose that of interest is the (non-random) vector/function EC, its (non-
random) optimizer
iopt = arg min (EC (i))
i2I

260
and its minimum/optimum value EC (iopt ).
a) Why is it "obvious" that

EC (i ) 6 EC iopt ?

b) Argue carefully that unless with probability 1 the non-random value iopt
is a minimizer of the random vector/function C,

EC (i ) < EC iopt

c) Say what the line of thinking in this problem implies about cross-validation
and a "pick-the-winner" prediction strategy. (Does it address the fact that al-
most always in predictive analytics contests, when …nal results based on predic-
tion for new cases are revealed they are worse than what contestants expect for
a test error?)

A.3 Section 1.4 Exercises


1. (5HW-16) Consider a 0-1 loss K = 2 classi…cation problem with p = 1,
1
0 = 1 = 2 , and pdfs

p (xj0) = I [ :5 < x < :5] and p (xj1) = 12x2 I [ :5 < x < :5]

a) What is the optimal classi…cation rule in this problem?


b) If one were to do "feature engineering" here, adding some function of
x, say t (x), to make a vector of features (x; t (x)) for classi…cation purposes
(hoping to eventually employ a good "linear classi…er"

f^ (x; t (x)) = I [a + bx + ct (x) > 0]

for appropriate constants a; b;and c), what (knowing the answer to a) ) would
be a good choice of t (x)? (Of course, one doesn’t know the answer to a) when
doing feature selection!)
c) What is the "minimum expected loss possible" part of Err in this problem?
d) Identify the best classi…cation rule of the form gc (x) = I [x > c]. (This
is g (x) for S = fgc g. This could be thought of as the 1-d version of a "best
linear classi…cation rule" here ... where linear classi…cation is not so smart.)
What is the "modeling penalty" part of Err in this situation?
e) Suggest a way that you might try to choose a classi…cation rule gc based
on a very large training sample of size N . Notice that a large training set would
allow you to estimate cumulative conditional probabilities P [x 6 cjy] by relative
frequencies
# number of training cases with xi 6 c and yi = y
# number of training cases with yi = y

261
2. (5E1-15) Consider two probability densities on the unit disk in <2 (i.e. on
(x1 ; x2 ) j x21 + x22 6 1 ),
q
1 3
p (x1 ; x2 j1) = and p (x1 ; x2 j2) = 1 (x21 + x22 )
2
and a 2-class 0-1 loss classi…cation problem with class probabilities 1 = 2 = :5.
a) Give a formula for a best-possible single feature T (x1 ; x2 ).
b) Give an explicit form for the theoretically optimal classi…er in this prob-
lem.

3. (5E1-18) Consider a K = 3 classi…cation model with p = 3 class-conditional


3
densities on [0; 1]

p (x1 ; x2 ; x3 j1) = 2x1 ; p (x1 ; x2 ; x3 j2) = 2x2 ; and p (x1 ; x2 ; x3 j3) = 2x3

a) Identify two real-valued features T1 (x) and T2 (x) that together provide
complete summarizations of all information about the class label y 2 f1; 2; 3g
provided by x = (x1 ; x2 ; x3 ).
b) For the case of 1 = 2 = 3 = 31 give the form of an optimal 0-1 loss
classi…er in terms of the values t1 and t2 of T1 (x) and T2 (x).
c) For the case of 1 = :6; 2 = :4; and 3 = 0 where L (^ y ; 1) = 10I [^
y 6= 1]
and otherwise L (^y ; y) = I [^
y 6= y], give the form of an optimal classi…er in terms
of the value of x = (x1 ; x2 ; x3 ).

4. (6E2-15) One can consider the possibility of "kernelizing" nearest-neighbor


prediction. ("Kernelization" amounts to mapping x 2 <p to K (x; ) in the
abstract function space, A, and using inner products in that space–and corre-
sponding distances–based on the kernel.) Using the Gaussian kernel K (x; z) =
2
exp kx zk , what is the abstract-space distance between K (x; ) and K (z; )?
Describe the set of training cases xi 2 <p with K (xi ; ) in the k-nearest neigh-
borhood of K (x; ) in the abstract space A.

5. (6E1-17) In Section 1.4.3 there is an assertion that for a …nite set B, say
B = fb1 ; b2 ; : : : ; bm g, for jAj the number of elements in A B, one kernel
function on subsets of B is

K (A1 ; A2 ) = 2jA1 \A2 j

(B could, for example, be a list of attributes that an item might or might not
possess.)
a) Prove that K is a kernel function using the "kernel mechanics" facts.
(Hint: You may …nd it useful to associate with each A B an m-dimensional
m
vector of 0s and 1s, call it xA 2 f0; 1g , with xAl = 1 exactly when bl 2 A.)
b) Let T (A) ( ) = K (A; ) = 2jA\ j map subsets of B to real-valued functions
of subsets of B. In the abstract space A (of real-valued functions of subsets of
B) what is the distance between T (A) and T (B), kT (A) T (B)kA ?

262
For N training "vectors" (Ai ; yi ) (Ai B and yi 2 <) consider the cor-
responding N points in A <, namely (T (Ai ) ; yi ) for i = 1; : : : ; N . De…ne
a k-neighborhood Nk (V ) of a point (function) V 2 A to be a set of k points
(functions) T (Ai ) with smallest kT (Ai ) V kA .
c) Carefully describe a SEL k-nn predictor of y, f (V ), mapping elements
V of A to real numbers y^ in <. Then describe as completely as possible the
corresponding predictor f (T (A)) mapping A B to y^ 2 <.
d) A more direct method of producing a kind of k-nn predictor of y is to
take account of the hint for part a) and for subsets A and C of B, to associate
m-vectors of 0s and 1s respectively xA and xC and de…ne a distance between
sets A and C as the Euclidean distance between xA and xC . This typically
produces a di¤erent predictor than the one in part c). Argue this point by
considering distances from xA and xC and from xA and xD in <m and from
T (A) and T (C) and from T (A) and T (D) in the space A for cases with jAj =
10; jCj = 4; jDj = 5; jA \ Cj = 2; and jA \ Dj = 3.

6. (6HW-13) For a > 0, consider the function K : <2 <2 ! < de…ned by
2
K (x;z) = exp kx zk

a) Use the facts about kernel functions in Section 1.4.3 to argue that K is a
2
kernel function. (Note that kx zk = hx; xi + hz; zi 2 hx; zi.)
b) Argue that there is a ' : <2 ! <1 so that with (in…nite-dimensional)
feature vector ' (x) the kernel function is a "regular <1 inner product"
1
X
K (x;z) = h'(x) ;'(z)i1 = 'l (x) 'l (z)
l=1

(You will want to consider the Taylor series expansion of the exponential func-
tion about 0 and coordinate functions of ' that are multiples of all possible
products of the form xp1 xq2 for non-negative integers p and q. It is not necessary
to …nd explicit forms for the multipliers, though that can probably be done.
You do need to argue carefully though, that such a representation is possible.)

7. (6E1-19) Consider a 3-class classi…cation problem with input vector x 2


5
[0; 1] . Suppose that class-conditional densities for x are of the forms
4 4
p (xj1) = (x1 x2 + x4 ) ; p (xj2) = (x3 x4 + x5 ) ; and p (xj3) = 4x4 x5
3 3
5
(all on [0; 1] ) and that class probabilities are 1 = 2 = 3 = 31 .
a) Give expressions for the smallest set of features possible for representing
x without loss of information in this model.
b) Evaluate the conditional probability that y = 1 given that x = 21 ; 12 ; 21 ; 12 ; 12 .
c) Give explicit prescriptions for conditions on x under which an optimal
classi…er (say, f (x)) has f (x) = 1, under which f (x) = 2, and under which
f (x) = 3.

263
8. (6E1-19) "Correlation functions" from time series and spatial modeling
(and analysis of "computer experiments") are a source of reproducing kernels
for use in machine learning. In a 1992 paper, Mitchell and Morris introduced
the useful correlation function
8 3
< 1 6d2 + 6 jdj if jdj < :5
3
(d) = 2 (1 jdj) if :5 jdj 1
:
0 if jdj > 1

(Interestingly, (d) is a natural cubic spline.55 ) Here we will use it to make the
reproducing kernel
K (x;z) (kx zk)
mapping <p <p ! <. For sake of concreteness, take p = 2.
a) For the mapping from <2 to the abstract function space A de…ned by the
kernel T (x) ( ) K (x; ), …nd numerical values for

kT (x)kA
1 1
T ((0; 0)) + 2T 2; 2 ; 3T ((1; 1)) A

kT ((0; 0)) T ((0; :6))kA

b) Consider the problem of (penalized SEL) prediction of y from x = (x1 ; x2 )


based on N training cases. Suppose that one will "correct" ridge regression
by addition of an appropriate linear combination of the functions K (xi ; ) to
produce a …nal predictor. That is, for centered ys and standardized xs consider
a predictor of form
N
X
f (x) = 1 x1 + 2 x2 + i K(xi ;x)
i=1

using for 1 > 0 and 2 > 0 a penalty

N 2
X
2 2
1 1 + 2 + 2 i K(xi ; )
i=1 A

(that penalizes both the "size" of the linear part of the predictor and the "size"
of the kernel-based correction to it). Develop (for …xed 1 and 2 and training
set and using notation K for the Gram matrix) a quadratic function of the co-
e¢ cients 1 ; 2 ; 1 ; 2 ; : : : ; N that you would optimize to produce a predictor.

9. (6E2-15) A 3-class classi…cation model has k = P [y = k] = 31 for k =


1; 2; 3, and densities p (xjk) for the conditional distributions of xjy = k, k =
1; 2; 3. For some pair of features T1 (x) and T2 (x) show that:
5 5 See Section 4.2 for the meaning of this language.

264
a) Optimal classi…cation for each of the 3 pairs of classes is linear classi…ca-
tion based on the features t1 and t2 . (De…ne the features and show the linear
classi…cation boundaries on axes like those below. Indicate the scales for the
features.)

b) The optimal 3-class classi…er can be realized as a "OVO" (one-versus-


one) combination of the three 2-class classi…ers. (Show the optimal classi…cation
boundaries in terms of features t1 and t2 and indicate which regions correspond
to which classi…cation decisions.)

10. (5HW-16) Return to the context of Problem 13 of Section A.2 and the
last/largest set of predictors. Center the y vector to produce (say)Y , remove
the column of 1s from the X matrix (giving a 100 9 matrix) and standardize
the columns of the resulting matrix, to produce (say) X .
a) If one somehow produces a coe¢ cient vector for the centered and
standardized version of the problem, so that

yb = 1 x1 + 2 x2 + + 9 x9

what is the corresponding predictor for y in terms of

1; x; x2 ; x3 ; x4 ; x5 ; sin x; cos x; sin 2x; cos 2x ?

b) Do the transformations and …t the equation in a) by OLS. How do the


…tted coe¢ cients and error sum of squares obtained here compare to what you
get simply doing OLS using the raw data (and a model including a constant
term)?

265
11. (5HW-18) Consider a toy 3-class classi…cation problem with conditional
2
distributions xjy that are N(0; 1) for y = 1, N 1; (:5) (the standard deviation
is :5) for y = 2; and N(2; 1) for y = 3 and class probabilities that are 1 = 2 =
3 = 1=3.
a) Plot the three functions
P [y = 1jx] ; P [y = 2jx] ; and P [y = 3jx]
b) The exposition identi…es an optimal pair of "features" for this 3-class
problem. Plot those two features, say t1 (x) and t2 (x) on the same set of axes.
c) Show that the optimal 3-class 0-1 loss classi…er for any set of class prob-
abilities 1 ; 2 ; and 3 can be written as a function of the features from b).
2
12. (5E1-20) 4. It is well-known that K (z; x) = (1 + xz) mapping <2 ! <
is a legitimate "kernel function."
a) Suppose that for the training data of Problem 28 in Section A.2, one
determines to …t a predictor for y of the form
5
X
f^ (x) = i K(x; xi )
i=1

by penalized least squares, using a ( ) multiple of the abstract (reproducing-


kernel-function-space) squared norm of f^ as the penalty. Write out in completely
explicit form the quantity to be minimized in order to do the …tting. (This is a
function of 1 ; 2 ; : : : ; 5 and . You don’t need to do scalar or matrix algebraic
simpli…cation, but your answer must evaluate to a number when values for the
s and are plugged in.)
b) T (x) ( ) = K (x; ) is non-linear map < ! A. Show that the span of
fT (x1 ) ; : : : ; T (x5 )g in A is 3-dimensional. (What kinds of functions of a single
real variable are mapped onto by T ?) In light of this fact and the nature of the
functions T (xi ) ( ) propose a di¤erent penalized least squares …tting problem
that has the same set of possible predictors as in a) but requires optimization
over only 3 coe¢ cients 1 ; 2 ; 3 (for a given penalty weight " "). (You do not
need to try to match the objective in a) exactly. You need only to provide a
sensible penalized version of …tting over the same set of functions.)

13. (6E2-13) Below are 3 (of hypothetically many) text "documents" in a


corpus using the alphabet A = fa,bg. Consider preparing a data matrix for
text processing for such documents. In particular, for each of the documents
below, prepare a row of a data matrix consisting of all 1-gram frequencies,
all 2-gram frequencies, and a feature quantifying the discounted (use = :5)
appearances of the interesting string "aaaa" in the documents. (In computing
this latter feature, count only strings with exactly 4 as in them. Don’t, for
example, count strings with 5 a’s by ignoring one of the interior a’s.)
Document 1: a a b a b b a a a b b b b a a a b a b a
Document 2: a a a b b b a b a a
Document 3: b b b b a b a b b a

266
A.4 Section 1.5 Exercises
1. (6E1-17) Consider the 2-class classi…cation model with the coding y 2
f 1; 1g and (for sake of concreteness) x 2 <1 . For g (x) a generic voting function
we’ll consider the classi…er

f (x) = sign (g (x))

Another (besides those mentioned in the exposition) "function loss" sometimes


discussed is
2
h (v) = (v 1)
a) Carefully derive the function g opt (x) optimizing Eh (yg (x)) over choices
of g.
b) To the extent possible, simplify a good upper bound on the 0-1 loss error
rate of a classi…er f (x) made from your g opt (x) from part a).
c) Suppose that in pursuit of a good classi…er, one wishes to optimize an
empirical version of Eh (yg (x)), based on a training set of size N , over the class
of functions of the form

g (xj 0; 1) =2 ( 0 + 1 x) 1

penalized by 2
1 for a > 0. ( is the standard normal cdf.) In as simple
a form as possible, give two equations to be solved simultaneously to do this
…tting.
d) Suppose that as a matter of fact the two class-conditional densities op-
erating are

p (xj 1) = I [0 < x < 1] and p (xj1) = 6x (1 x) I [0 < x < 1]

and that ultimately what is desired is a good ordering function O (x), one that
produces a small value of the "AUC" criterion. Do you expect the methodology
of part c) to produce a function g xj ^0 ; ^1 that would be a good choice of
O (x)? Explain carefully.

2. (6HW-17) Argue carefully that losses h1 ; h2 ; and h3 (negative Bernoulli


loglikelihood term, exponential, and hinge losses) have optimizers of

Eh (yg (x))

(functions g opt (x)) as indicated in the exposition.

3. (5HW-18) In a 2-class classi…cation problem using coding f 1; 1g for the


classes, the fake data below constitute a very small/toy training set.

y 1 1 1 1 1 1 1 1
x 1 2 3 4 5 6 7 8

267
Consider the production of a "voting function" of the form
8
X 2
gb (x) = bi exp c jx xi j
i=1

by choice of the 8 coe¢ cients bi (for some choice of c > 0) under the "function
loss" h2 (u) = exp ( u). (In the parlance of machine learning, the component
2
functions exp c jx xi j are data-dependent p = 1 "radial basis functions.")
In fact, consider "penalized" …tting.
a) One possible penalized …tting criterion is
8 8
1X X
exp ( yi gb (xi )) + b2i
8 i=1 i=1

for some > 0. For choices of c = :5 and c = 1 optimize this criterion for two
di¤erent values of > 0 and plot the four resulting voting functions on the same
set of axes. Choose (by trial and error) two values of that produce clearly
di¤erent optimizing functions. (optim in R or some other canned routine will
be adequate to do this 8-d optimization.)
2
b) The function K (x; z) = exp c jx zj is a "kernel function" in the
sense of Section 1.4.3. That implies that the 8 8 Gram matrix

K = (K(xi ; xj )) i=1;2;:::;8
j=1;2;:::;8

is non-negative de…nite. Thus, with b = (b1 ; b2 ; : : : ; b8 ) , b0 Kb > 0 and another


0

P
8
possible penalized …tting criterion replaces b2i in part a) with b0 Kb. For the
i=1
same values of c and you used in part a) redo the optimization using this
second penalization criterion and plot the resulting voting functions. Notice, by
the way, that the penalty in a) is a c ! 1 limit of this second penalty!
c) As indicated in Section 1.4.3, the mapping T (x) = K (x; ) from <1 to
functions <1 ! <1 picks out N = 8 functions that are essentially normal pdfs.
Linear combinations of these form a linear subspace of this function space.
Further, there is a valid inner product that can be de…ned on this subspace, for
which
hT (x) ; T (z)iA = K (x; z)
Using this inner product,

what is the inner product of two elements of this subspace, say gb (x) and
gb (x)?
what is the distance between T (x) and T (z),
1=2
kT (x) T (z)kA = hT (x) T (z) ; T (x) T (z)iA ?

268
how is the penalty in b) related to kgb kA (the norm of the linear combi-
nation of functions in the function space)?

4. (6E2-13) Consider a toy 2-class classi…cation problem with p = 1 and


discrete conditional distributions of x indicated in the following table.

x 1 2 3 4 5 6 7 8 9 10
p (xjy = 1) :04 :07 :06 :03 :24 0 :02 :09 :25 :2
p (xjy = 0) :1 :1 :1 :1 :1 :1 :1 :1 :1 :1

a) If P [y = 1] = 2=3 what is the optimal classi…er here and what is its error
rate (for 0-1 loss)?
b) If one cannot observe x completely, but only
8
>
> 2 if x is 1 or 2
>
>
< 4 if x is 3 or 4
x = 6 if x is 5 or 6
>
>
>
> 8 if x is 7 or 8
:
10 if x is 9 or 10

instead, what is the optimal classi…er and what is its error rate (again assuming
that P [y = 1] = 2=3 and using 0-1 loss)?

5. (6E1-19) Return to the situation of Problem 9 of Section A.2. For this toy
dataset the 2 classes are balanced, and a 3-nearest-neighbor neighborhood has
a fraction of "class 1" cases 0; 13 ; 23 ; or 1. Suppose that 3-nn results from this
training set will be used to produce a 0-1 loss classi…er for a scenario in which
(there is severe class imbalance and) the actual probabilities of classes are 0 =
:1 and 1 = :9. Find (and carefully argue that it is correct) the classi…cation
appropriate for an x for which the 3-nearest-neighbor neighborhood has fraction
1
3 of "class 1" cases.

6. (6E1-19) For voting function g (x) in a 2-class classi…cation problem (with


1-1 coding) and function losses h1 and h2 with I [v < 0] 6 h1 (v) 6 h2 (v),
presuming that P 1 (g (x) = 0jy = 1) = P1 (g (x) = 0jy = 1) = 0, the 0-1 loss
error rate of the classi…er f (x) =sign(g (x)), namely

Err = EI [yg (x) < 0]

has upper bounds

b1 (g) = Eh1 (yg (x)) and b2 (g) = Eh2 (yg (x))

a) Why do you know that b1 (g) 6 b2 (g)? Under what circumstances will
it be the case that b1 (g) < b2 (g)?
b) If 1) g minimizes b1 (g) over choices of g, 2) g minimizes b2 (g) over
choices of g, and 3) in fact your conditions in a) are met to imply that b1 (g ) <

269
b2 (g ), does it necessarily follow that g is a strictly better voting function
(produces a better error rate) than g for the original 0-1 loss classi…cation
problem? Explain why or why not.

7. (DMC-19) The 2019 Data Mining Cup sponsored by Prudsys AG featured


a classi…cation problem for fraud detection based on numerical characteristics
of self-checkout transactions at a retail location. The loss function employed,
L (^
y ; y) employed (actually, negative losses or "gains" were speci…ed by the
company), was for y^ and y belonging to ffraud, no fraudg

L (fraud; fraud) = 5; L (fraud; no fraud) = 25;


L (no fraud; fraud) = 5; and L (no fraud; no fraud) = 0

a) An optimal 2-class classi…er for this problem decides in favor of fraud if


P [y = fraudjx] > c. Evaluate c.
b) An optimal 2-class classi…er for this problem decides in favor of fraud
if L (x) (p (xjfraud) =p (xjno fraud)) > c where c depends upon fraud .
Evaluate c for fraud = :1; :01; and :001.

8. (5HW-18) For the toy scenario of Problem 11 of Section A.3, consider a 2-


class classi…cation model for y = 1 and y = 2. Suppose the object is to produce
a function O (x) minimizing (for independent x p (xj1) and x p (xj2))

P [O(x) <O(x )]

a) Plot an optimizing function.


b) Cases i = 1; 2; : : : ; 60 in a hypothetical test set have xi = 2 + (i=10)
and you must make an ordering of the test cases from "least to most likely"
to have corresponding yi = 2. Assign values 1 through 60 to the test set cases
(1 $ least likely to 60 $ most likely) that you would submit in a predictive
analytics contest where the "AUC criterion" is used to judge performance.

A.5 Section 1.6 Exercises


1. (5HW-14) Return to the context of Problem 7 Section A.2.
a) Find the marginal densities for all of the p ((x1 ; x2 ) jk). De…ne 4 new den-
sities p ((x1 ; x2 ) jk) on the unit square by the products of the 2 marginals for the
corresponding p ((x1 ; x2 ) jk). Consider a 0-1 loss K = 4 classi…cation problem
?approximating? the one in the original problem by using the p ((x1 ; x2 ) jk) in
place of the p ((x1 ; x2 ) jk) for the = (:25; :25; :25; :25) case. Make a 101 101
grid of points of the form (i=100; j=100) for integers 0 6 i; j 6 100 and for each
such point determine the value of the optimal classi…er for this new problem.
Using these values, make a plot (using a di¤erent plotting color and/or symbol
2
for each value of y^) showing the regions in (0; 1) where the optimal classi…er
classi…es to each class. Compare this plot to the one in Problem 7 of Section
A.2. (The classi…er here might be called a "naïve Bayes" classi…er.)

270
b) Find the p ((x1 ; x2 ) jk) conditional densities for x2 jx1 . Note that based
on these and the marginals in part a) you can simulate pairs from any of the 4
joint distributions by …rst using the inverse probability transform of a uniform
variable to simulate from the x1 marginal and then using the inverse probability
transform to simulate from the conditional of x2 jx1 . (It’s also easy to use a
2
rejection algorithm based on (x1 ; x2 ) pairs uniform on (0; 1) .)
c) Generate 2 datasets consisting of multiple independent pairs (x; y) where
y is uniform on f1; 2; 3; 4g and conditioned on y = k, the variable x has density
p ((x1 ; x2 ) jk). Make …rst a small training set with N = 400 pairs (to be used
below). Then make a larger test set of 10; 000 pairs. Use the test set to evaluate
the (conditional on the training set) error rates of the optimal rule from Problem
7 Section A.2 and then the "naïve" rule from part a).
d) Based on the N = 400 training set from c), for several di¤erent numbers
of neighbors (say 1; 3; 5; 10) make a plot like that required in part c) showing the
regions where the nearest neighbor classi…er classi…es to each of the 4 classes.
Then evaluate the (conditional on the small training sets) test error rates for
the nearest neighbor rules.
e) Based on the training set, one can make estimates of the 2-d densities as
1 X
2
p^ (xjk) = h xjxi ;
# [i with yi = k]
i with yi =k

for h j ; 2 the bivariate normal density with mean vector and covariance
matrix 2 I. (Try perhaps :1.) Using these estimates and the relative
frequencies of the possible values of y in the training set

# [i with yi = k]
^k =
N
an approximation of the optimal classi…er is
X
f^ (x) = arg max ^k p^ (xjk) = arg max h xjxi ; 2
k k
i with yi =k

Make a plot like that required in part a) showing the regions where this classi…es
to each of the 4 classes. Then evaluate the (conditional on the training set) test
error rate for this classi…er.

A.6 Section 2.1 Exercises


1. (5E1-16) Kernel methods in statistical learning are built on the fact that
for a legitimate kernel function K (x; z) there is an abstract linear space A and a
(non-linear) transform T (x) from <p to that space for which the inner product
of transformed elements of <p is

hT (x) ; T (z)iA = K (x; z)

271
2
Use the Gaussian kernel function K (x; z) = exp kx zk in what fol-
lows. (k k is the usual <p norm.)
a) For an input vector xi 2 <2 , what is the norm of T (xi ) in the abstract
space?
b) For input vectors xi 2 <2 and xl 2 <2 , how is the distance between
T (xi ) and T (xl ) in the abstract space related to the distance between xi and
xl in <p ?

2. (6E1-11) Consider the p-dimensional input space <p and kernel functions
mapping <p <p ! <.
a) Show that for : <p ! <, the function K (x;z) = (x) (z) is a valid
kernel. (You must show that for distinct x1 ; x2 ; : : : ; xN , the N N matrix
K = (K(xi ;xj )) is non-negative de…nite.)
b) Show that for two kernels K1 (x;z) and K2 (x;z) and two positive con-
stants c1 and c2 , the function c1 K1 (x;z) + c2 K2 (x;z) is a kernel.
c) By virtue of a) and b), the functions K1 (x; z) = 1 + xz and K2 (x; z) =
2
1 + 2xz are both kernels on [ 1; 1] . They produce inner product spaces of
functions. Show these are di¤erent.
2
3. (6E1-15) Consider the small space of functions on [ 1; 1] that are linear
combinations RR
of the 4 functions 1; x1 ; x2 ; and x1 x2 , with inner product de…ned
by hh; gi = h (x1 ; x2 ) g (x1 ; x2 ) dx1 dx2 . Find the element of this space
[ 1;1]2
2
closest to h (x1 ; x2 ) = x21 + x22 (in the L2 [ 1; 1] function space norm kgk
1=2
hg; gi ). (Note that the functions 1; x1 ; x2 ; and x1 x2 are orthogonal with this
inner product.)

A.7 Section 2.2 Exercises


1. (6HW-15) Consider the linear space of functions on [ ; ] of the form

h (t) = a + bt + c sin t + d cos t


R
Equip this space with the inner product hu; gi u (t) g (t) dt and norm kgk
1=2
hg; gi . Use the Gram-Schmidt process to orthogonalize the set of functions
f1; t; sin t; cos tg and produce an orthonormal basis for the space.

2. (6HW-11) Consider the linear space of functions on [0; 1] of the form

h (t) = a + bt + ct2 + dt3


R1
Equip this space with the inner product hu; gi 0
u (t) g (t) dt and norm kf k =
1=2
hg; gi . Use the Gram-Schmidt process to orthogonalize the set of functions
1; t; t2 ; t3 and produce an orthonormal basis for the space.

272
2
3. (6HW-13) Consider the linear space of functions on [0; 1] of the form

h (t; s) = a + bt + cs + dt2 + es2 + hts


RR
Equip this space with the inner product hu; gi u (t; s) g (t; s) dtds and
[0;1]2
1=2
norm kgk = hg; gi . Use the Gram-Schmidt process to orthogonalize the set
of functions 1; t; s; t2 ; s2 ; ts and produce an orthonormal basis for the space.

4. (6E1-11) Consider the space of functions on [ 2; 2] corresponding to the


2
kernel K (x; z) = 1 + xz exp (x + z) on [ 2; 2] . (All functions K (x; c) of x
for a c 2 [ 2; 2] belong to the image of [ 2; 2] under the non-linear transform
T (x) ( ) = K (x; ).)
a) Show that the functions g (x) = 1 and h (x) = x exp (x) both belong to
this image of the transform T .
b) Determine whether or not g and h are orthonormal. If they are not, …nd
an orthonormal basis for the span of fg; hg.

2 2
5. (6E2-13) Consider the function K ((x; y) ; (u; v)) mapping [ 1; 1] [ 1; 1]
to < de…ned by
2 2 2
K ((x; y) ; (u; v)) = (1 + xu + yv) + exp (x u) (y v)

on its domain.
a) Argue carefully that K is a legitimate "kernel" function.
b) Pick any two linearly independent elements of the space of functions that
2
are linear combinations of "slices" of the kernel, K ((x; y) ; ), for an (x; y) 2 [ 1; 1]
and …nd an orthonormal basis for the 2-dimensional linear sub-space they span.

6. (5E1-18) The function

K (x;z) = exp ( jx1 z1 j jx2 z2 j)

mapping <2 <2 ! < is a kernel function. Consider three real-valued functions
(of z 2 <2 ):

T ((1; 0)) (z) = K ((1; 0) ;z) = exp ( j1 z1 j jz2 j) ;


T ((0; 1)) (z) = K ((0; 1) ;z) = exp ( jz1 j j1 z2 j) ; and
T ((0; 0)) (z) = K ((0; 0) ;z) = exp ( jz1 j jz2 j)

Using the inner product for the linear space of functions mapping <2 ! <
de…ned for kernel slices by hT (x) ; T (w)iA = K (x;w), …nd the projection of
T ((0; 0)) onto the subspace of functions spanned by the two functions T ((1; 0))
and T ((0; 1)) (i.e. the set of all linear combinations c T ((1; 0)) + d T ((0; 1))
for constants c and d).

273
7. (5HW-18) Below is a small fake dataset with p = 2 and N = 8.

x1 x2 y
1 0 2:03
0 1 :56
1 0 2:21
0 1 1:46
2 2 5:78
1 1 :72
2 2 6:46
1 1 1:37

First center the y values and standardize both x1 and x2 . (We will abuse
notation and use x and z to stand for standardized versions of input vectors.)
2
Make use of the kernel function K (x;z) = exp kx zk and the mapping
2
T (x) = K (x; ) that associates with input vector x 2 < the function K (x; ) :
<2 ! < (an abstract "feature"). In the (very high-dimensional) space of
functions mapping <2 ! <, the N = 8 training set generates an 8-d subspace
of functions consisting of all linear combinations of the T (xi ). Two possible
inner products in that subspace are the "L2 " inner product
ZZ
hg; hiL2 = g (x) h (x) dx
<2

and the inner product de…ned for functions in the range of T ( ) by

hT (x) ; T (z)iA = K (x;z)

Apply the …rst 3 steps of the Gram-Schmidt process to the abstract features of
the training data (considered in the order given in the data table) to identify 3 or-
thonormal functions <2 ! < that are linear combinations of T (x1 ) ; T (x2 ) ; T (x3 ).
Do this …rst using the L2 inner product, and then using the kernel-based inner
product. Are the two sets of 3 functions the same?

A.8 Section 2.3 Exercises


1. (6HW-15) Consider the 5 4 data matrix
2 3
2 4 7 2
6 4 3 5 5 7
6 7
X=66 3 4 6 1
7
7
4 5 2 4 2 5
1 3 4 4

a) Use R and …nd the QR and singular value decompositions of X. What


are the two corresponding bases for C (X)?

274
b) Use the singular value decomposition of X to …nd the eigen (spectral)
decompositions of X 0 X and XX 0 (what are eigenvalues and eigenvectors?).
c) Find the best rank = 1 and rank = 2 approximations to X.

2. (6HW-11) Carry out the steps of Problem 1 above using the matrix
2 3
1 1 1
6 2 1 1 7
X=6 4 1 2 1 5
7

2 2 1

3. (6E2-15) Here is some simple R code and output for a small N = 5 and
p = 4 dataset.
>X
[,1] [,2] [,3] [,4]
[1,] 0.4 2 -0.5 0
[2,] -0.1 0 -0.3 1
[3,] 0.4 0 -0.1 0
[4,] 0.4 0 0.0 -1
[5,] 0.1 2 0.7 0
>
>svd(X)
$d
[1] 2.8551157 1.4762267 0.9397253 0.3549439
$u
[,1] [,2] [,3] [,4]
[1,] 0.70256076 0.06562895 0.6458656 -0.2618499
[2,] -0.01458943 0.69768837 0.1798028 0.2661822
[3,] 0.01628552 -0.05282808 0.2689008 0.8815301
[4,] 0.02268773 -0.71093125 0.2403923 0.1625961
[5,] 0.71092586 -0.02664090 -0.6484076 0.2388488
$v
[,1] [,2] [,3] [,4]
[1,] 0.12929953 -0.23823242 0.403567340 0.8738766
[2,] 0.99014314 0.05282123 -0.005410155 -0.1296041
[3,] 0.05222766 -0.17306746 -0.912659300 0.3665691
[4,] -0.01305627 0.95420275 -0.064475843 0.2918382

a) What is the best rank = 2 approximation to the 5 4 data matrix (in


terms of "Frobenius norm" of the di¤erence between X and the approximation)?
b) Interpret the fact that by far the largest (in absolute value) number in
the …rst column of the "v" matrix is .99014314.

275
A.9 Section 2.4 Exercises
1. (6HW-15) Center the columns of X from Problem 1 of Section A.8 to make
f
the centered data matrix X.
f What are the principal
a) Find the singular value decomposition of X.
component directions and principal components for the data matrix? What are
the "loadings" of the …rst principal component?
f
b) Find the best rank = 1 and rank = 2 approximations to X.
0
c) Find the eigen decomposition of the sample covariance matrix 15 Xf X.
f
Find best 1- and 2-component approximations to this covariance matrix.
f
f Repeat
d) Now standardize the columns of X to make the matrix X.
f
f
parts a), b), and c) using this matrix X.

2. (6HW-11) Carry out the steps of Problem 1 above using the matrix X
from Problem 2 of Section A.8.

3. (5HW-14) Consider the small (7 3 ) fake X matrix below.


2 3
10 10 :1
6 11 11 :1 7
6 7
6 9 9 0 7
6 7
X=6 6 11 9 2:1 7
7
6 9 11 2:1 7
6 7
4 12 8 4:0 5
8 12 4:0

(Note, by the way, that x3 x2 x1 .)


a) Find the QR and singular value decompositions of X. Use the latter
and give best rank = 1 and rank = 2 approximations to X.
b) Subtract column means from the columns of X to make a centered data
matrix. Find the singular value decomposition of this matrix. Is it approx-
imately the same as that in part a)? Give the 3 vectors of the principal
component scores. What are the principal components for case 1?

Henceforth consider only the centered data matrix of b).


c) What are the singular values? How do you interpret their relative sizes
in this context? What are the …rst two principal component directions? What
are the loadings of the …rst two principal component directions on x3 ? What is
the third principal component direction? Make scatterplots of 7 points (x1 ; x2 )
and then 7 points with …rst coordinate the 1st principal component score and
the second the 2nd principal component score. How do these compare? Do
you expect them to be similar in light of the sizes of the singular values?
d) Find the matrices Xv j v 0j for j = 1; 2; 3 and the best rank = 1 and
rank = 2 approximations to X. How are the latter related to the former?
e) Compute the (N divisor) 3 3 sample covariance matrix for the 7 cases.
Then …nd its singular value decomposition and its eigenvalue decomposition.

276
Are the eigenvectors of the sample covariance matrix related to the principal
component directions of the (centered) data matrix? If so, how? Are the eigen-
values/singular values of the sample covariance matrix related to the singular
values of the (centered) data matrix. If so, how?

4. (5HW-16) Consider the small (N = 11) fake p = 2 set of predictors in the


table below.
x1 11 12 13 14 13 15 17 16 17 18 19
x2 18 12 14 16 6 10 14 4 6 8 2

a) Plot raw and standardized versions of 11 predictor pairs (x1 ; x2 ) on the


same set of axes (using di¤erent plotting symbols for the two versions and a 1:1
aspect ratio for the plotting). (One can standardize variables in R using the
scale() function.)
b) Find sample means, sample standard deviations, and the sample correla-
tions for both versions of the predictor pairs.
c) Consider the small (11 2) fake X matrices corresponding to the raw
and standardized versions of the data. Interpret the …rst principal component
direction vectors for the two versions and say why (in geometric terms) they are
much di¤erent.

5. (5HW-14) The functions


2
K1 (x;z) = exp kx zk and
d
K2 (x;z) = (1 + hx;zi)

are legitimate kernel functions for choice of > 0 and positive integer d. Find
the …rst two kernel principal component vectors for X in Problem 3 above for
each of cases

K1 with two di¤erent values of (of your choosing), and


K2 for d = 1; 2.

If there is anything to interpret (and there may not be) give interpretations
of the pairs of principal component vectors for each of the 4 cases. (Be sure to
use the vectors for "centered versions" of the function space principal component
"direction vectors"/functions.)

6. (6HW-17) The function of (x; z) 2 <p <p de…ned by


d
K (x; z) = (1 + c hx; zi)

for c > 0 and positive integer d is well-known to be a kernel function.


a) Argue that indeed K is a kernel function (is non-negative de…nite) using
the facts from Bishop quoted in Section 1.4.3.

277
For d = 2 consider the c = 1 and c = 2 cases of this construction for p = 2.
b) Describe the sets of functions mapping <2 ! < that comprise the abstract
linear spaces associated with the reproducing kernels. What is the dimension
of these spaces?
c) Identify for each case a transform T : <2 ! <M so that

K (x; z) = hT (x) ; T (z)i

(an ordinary <M inner product of the transformed data vectors).


d) For x and z belonging to <2 …nd the distances in the two inner product
function spaces between T (x) ( ) = K (x; ) and T (z) ( ) = K (z; ). (Notice
that these are not the same. Metrics implied by the kernels change with the
kernels.)
e) Below is a small fake dataset. For the c = 1 case, consider these data in
the order listed and use as many of the data vectors as necessary to produce
a (data-dependent) orthonormal basis for the function space spanned by the
T (xi ) ( ). (Use the Gram-Schmidt process in the abstract space.)

x1 x2
1 0
0 1
1 0
0 1
2 2
1 1
2 2
1 1

f ) Note that the fake dataset of part e) is centered in <2 . Find ordinary prin-
cipal component direction vectors v 1 and v 2 and corresponding 8-dimensional
vectors of principal component scores for the dataset. Then …nd the …rst two
kernel principal component vectors corresponding to the c = 1 case of K.

7. (6E1-13) Consider a small fake dataset consisting of N = 6 data vectors in


<2 and use of a kernel function (mapping <2 <2 ! <) de…ned by K (x;z) =

278
2
exp 3 kx zk . The data and Gram matrices are
0 1 0 1
1:01 :99 1 1
B :99 1:01 C B 1 1 C
B C B C
B :01 :01 C B 0 0 C
X=B
B
C
C
B
B
C
C
B 0 0 C B 0 0 C
@ :01 :01 A @ 0 0 A
2:00 2:00 2 2
0 1 0 1
1 :998 :003 :003 :003 :000 1 1 0 0 0 0
B :998 1 :003 :003 :003 :000 C B 1 1 0 0 0 0 C
B C B C
B :003 :003 1 :999 :998 :000 C B 0 0 1 1 1 0 C
and K = B
B
C
C
B
B
C
C
B :003 :003 :999 1 :999 :000 C B 0 0 1 1 1 0 C
@ :003 :003 :998 :999 1 :000 A @ 0 0 1 1 1 0 A
:000 :000 :000 :000 :000 1 0 0 0 0 0 1

As it turns out (using the approximate form for K)


1 1 1
C=K JK KJ + J KJ
6 6 36
has (approximately) a SVD with two non-zero singular values (namely 2:43 and
1:23) and corresponding vectors of principal components
0 0
u1 = ( :51; :51; :39; :39; :39; :15) and u2 = ( :27; :27; :12; :12; :12; :9)

Say what both principal components analysis on the raw data and kernel prin-
cipal components indicate about these data.

8. (6E1-19) Let
2 3 2 3
15 5 1 1 1 1
6 15 5 1 7 6 1 1 1 7
6 7 6 7
6 5 15 1 7 6 1 1 1 7
6 7 6 7
6 5 15 1 7 1 6 1 1 1 7
X=6
6
7 U=
7 p 6
6
7
7
6 5 15 1 7 8 6 1 1 1 7
6 5 15 1 7 6 1 1 1 7
6 7 6 7
4 15 5 1 5 4 1 1 1 5
15 5 1 1 1 1
0 1 2 p1 p1
3
40 2 2
0
D = diag @ p20 A and V = 4 p1
2
p1
2
0 5
2 2 0 0 1

U ; D; and V are the elements of the SVD for X. Use this to answer the
following.
a) Find the best rank = 1 approximation to the matrix X.
b) Identify a (3 1) unit vector w such that the 8 row vectors in X lie
"nearly on" a plane in <3 perpendicular to w.

279
c) Give the eigen decomposition of the (8-divisor) sample covariance matrix
of a p = 3 dataset with cases given by the rows of X. (Give the 3 eigenvalues
and corresponding eigenvectors.)

9. (6E2-15) 6. Below is a small fake p = 2 dataset and a scatterplot for it.


Consider making graphical spectral features for the dataset, using the symmetric
set of index pairs N2 (based on 3-nearest-neighbor neighborhoods–a neighbor-
hood including the point itself) and weight function w (d) = exp d2 . Set up
an appropriate adjacency matrix and give the 8 node degrees.

10. (5HW-18) Return to the context of Problem 7 of Section A.7. Note that
P
8
the function MT = 81 T (xi ) is a linear combinations of (is in the subspace
i=1
of functions generated by) the T (xi ). It makes sense to "center" the abstract
features generated by the training set, replacing each T (xi ) with
S (xi ) = T (xi ) MT
a) Compute the matrix
C = hS (xi ) ; S (xj )iA i = 1; 2; : : : ; 8
j = 1; 2; : : : ; 8
that is the "centered Gram matrix" for kernel PCA in displays (48) and (49).
c) Do an eigen analysis for the matrix C. (For Euclidean features, this
matrix would be a multiple of a sample covariance matrix.) The eigenvectors
of this matrix give kernel principal component scores for the dataset. Consider
the …rst and second of these. To the extent possible, provide interpretations
for them.
d) Find the projection of the function S (:5; :5) onto the span of fT (xi )gi=1;:::;8
in A and compare contour plots for the function and its projection.

11. (5HW-16) A small example of Prof. Morris involves an N = 11 point


dataset in the table below.
y 1:003 :807 :669 :628 :554 :511 :531 :502 :610 :701 :942
x 0 :1 :2 :3 :4 :5 :6 :7 :8 :9 1

280
Center the y values and standardize x. (We will abuse notation and use x
and z to stand for standardized versions of input values.)
2
This question will make use of the kernel function K (x; z) = exp :5 (x z)
and the mapping T (x) ( ) = K (x; ) that associates with input value x 2 <
the function K (x; ) : < ! < (an abstract "feature"). In the (very high-
dimensional) space of functions mapping < ! <, the N = 11 training set
generates an 11-d subspace of functions consisting of all linear combinations
1
P
11
of the T (xi ). As in Problem 10 above set MT = 11 T (xi ) and de…ne
i=1
S (xi ) = T (xi ) MT .
a) Compute the matrix

C = hS (xi ) ; S (xj )iA i = 1; 2; : : : ; 11


j = 1; 2; : : : ; 11

that is the "centered Gram matrix" for kernel PCA in displays (48) and (49).
c) Do an eigen analysis for the matrix C. (For Euclidean features, this
matrix would be a multiple of a sample covariance matrix.) The eigenvectors
of this matrix give kernel principal component scores for the dataset. Consider
the …rst and second of these. To the extent possible, provide interpretations
for them.
d) Find the projection of the function S (:65) onto the span of fT (xi )gi=1;:::;11
in A and plot the function and its projection on the same set of axes.

A.10 Section 3.1 Exercises


1. (6HW-11) Consider "data augmentation" methods of penalized least squares
…tting. p
a) Augment a centered X matrix with p new rows given by I and Y
p p
by adding p new entries 0. Argue that OLS …tting with the augmented dataset
ridge
returns ^ as a …tted coe¢ cient vector.
enet
b) Show how the elastic net …tted coe¢ cient vector ^ 1 ; 2 could be found
using lasso software and an appropriate augmented dataset.

2. (6E1-11) Consider the p = 3 linear prediction problem with N = 5 and


training data
0 1 2 3
p1 0 p1
2 20 2
B 0 p1 p1 C 6 7
B 2 20 C 6 3 7
B C
X=B 0 p1
2
p1
20 C and Y = 6
6 1 7
7
B C 4 5
@ p1 0 p1 A 1
2 20
0 0 p4 3
20

In answering the following, use the notation that the jth column of X is xj .

281
ols
a) Find the …tted OLS coe¢ cient vector ^ .
b) For = 10 …nd the vector c 2 <3 minimizing
ols 0 ols
Y X diag(c) ^ Y X diag(c) ^ + 10 c

over choices of c with non-negative entries.


ridge
c) For > 0 …nd the …tted ridge coe¢ cient vector, ^ .
^ 0
d) For > 0 …nd a …tted coe¢ cient vector minimizing (Y Xb) (Y Xb)+
b22 + b23 as a function of b 2 <3 .
e) Carefully specify the entire Least Angle Regression path of either Y^ or
^ values.

3. (6E1-13) Consider the p = 1 prediction problem with N = 8 and training


data as below.
0 p 1 0 1
1 1 p 2 0 2 0 0 0 8
B 1 1 2 0 2 0 0 0 C B 4 C
B p C B C
B 1 1 2 0 0 2 0 0 C B 4 C
B p C B C
B C B 0 C
B 1 1 2 p0 0 2 0 0 C B C
X=B C and Y = B 2 C
B 1 1 0 p 2 0 0 2 0 C B C
B C B 3 C
B 1 1 0 0 C
B p2 0 0 2
C B C
@ 6 A
@ 1 1 0 2 A
p2 0 0 0
1 1 0 2 0 0 0 2 5

Use the notation that the jth column of X is xj .


ols
a) Find the …tted OLS coe¢ cient vector ^ for a model including only
x1 ; x2 ; x3 ; x4 as predictors.
lasso
b) Center Y to create Y and let xj = 2p 1
x for each j . Find ^
2 j
2 <7
optimizing
0 12
X8 8
X X 8
@yi bj xij A + 5 jbj j
i=1 j=2 i=2

over choices of b 2 <7 .


c) The LAR algorithm applied to Y and the set of predictors xj for j =
2; 3; : : : ; 8 begins at Yc = 0 and takes a piecewise linear path through <8 to
ols
Yc . Identify the …rst two points in <8 at which the direction of the path
changes, call them W 1 and W 2 . (Here you may well wish to use both the
connection between the LAR path and the lasso path and explicit formulas for
the lasso coe¢ cients.)

282
4. (5HW-14) Here is a small fake dataset with p = 4 and N = 8.

y x1 x2 x3 x4
3 1 1 1 1
5 1 1 1 1
13 1 1 1 1
9 1 1 1 1
3 1 1 1 1
11 1 1 1 1
1 1 1 1 1
5 1 1 1 1

Notice that the y is centered and


p the xs are orthogonal (and can easily be
made orthonormal by dividing by 8). Use the explicit formulas for …tted
coe¢ cients in the orthonormal features context to make plots (on a single set
of axes for each …tting method, 5 plots in total) of

1. ^1 ; ^2 ; ^3 ; and ^4 versus M for best subset (of size M ) regression,

2. ^1 ; ^2 ; ^3 ; and ^4 versus for ridge regression,


3. ^1 ; ^2 ; ^3 ; and ^4 versus for lasso,
4. ^1 ; ^2 ; ^3 ; and ^4 versus for = :5 in the elastic net penalty

0 1
N
X p
X p
X
2
(yi y^i ) + @(1 ) ^j + ^2 A
j
i=1 j=1 j=1

5. ^1 ; ^2 ; ^3 ; and ^4 versus for the non-negative garrote.

5. (6HW-11) (3.23 of HTF) Suppose that columns of X with rank p have


been standardized, as has Y . Suppose also that
1
jhxj ;Y ij = 8j = 1; : : : ; p
N
ols ols
Let ^ be the usual least squares coe¢ cient vector and Y^ be the usual
ols
projection of Y onto the column space of X. De…ne Y ( ) = X ^
^ for
2 [0; 1]. Find
1 D E
xj ;Y Y^ ( ) 8j = 1; : : : ; p
N
ols 0 ols
in terms of ; ; and Y Y^ Y Y^ . Show this is decreasing in .
What is the implication of this as regards the LAR algorithm?

283
6. (5HW-14) Return to the context of Problem 13 of Section A.2 and the
last/largest set of predictors. Center the y vector to produce (say) Y , remove
the column of 1s from the X matrix (giving a 100 9 matrix) and standardize
the columns of the resulting matrix, to produce (say) X .
a) Augment Y to Y by adding 9 values 0 at the end of the vector (to
produce a 109 1 vector) and for value = 4 augment X to X (a 109 p 9
matrix) by adding 9 rows at the bottom of the matrix in the form of I .
9 9
What quantity does OLS based on these augmented data seek to optimize?
What is the relationship of this to a ridge regression objective?
b) Use trial and error and matrix calculations based on the explicit form
ridge
of ^ given in Section 3.1.1 to identify a value ~ for which the error sum
of squares for ridge regression is about 1:5 times that of OLS in this problem.
Then make a series of at least 5 values from 0 to ~ to use as candidates for .
Choose one of these as an "optimal" ridge parameter opt here based on 10-
fold cross-validation (as was done in Problem 13 of Section A.2). Compute the
corresponding predictions y^iridge and plot both them and the OLS predictions
as functions of x (connect successive (x; y^) points with line segments). How do
the "optimal" ridge predictions based on the 9 predictors compare to the OLS
predictions based on the same 9 predictors?

7. (6E1-13) Consider prediction of a 0/1 (binary) response using a model that


says that for two (standardized) predictors z1 and z2

exp ( + 1 z1i + 2 z2i )


P [yi = 1j (z1i ; z2i )] =
1 + exp ( + 1 z1i + 2 z2i )

(Training data are N vectors (z1i ; z2i ; yi ).) For this problem, one might de…ne
a (log-likelihood-based) training error as
N
X N
X
err (a; b1 ; b2 ) = ln (1 + exp (a + b1 z1i + b2 z2i )) yi (a + b1 z1i + b2 z2i )
i=1 i=1

How would you regularize …tting of this model "in ridge regression style" (pe-
nalizing only b1 and b2 and not a)? Derive 3 equations that you would need to
solve simultaneously to carry out regularized …tting.

8. (6E2-13) Suppose that for a pair of positive constants 1 6= 2 the predictors


f^1 and f^2 are corresponding ridge regression predictors (their coe¢ cient vectors
solve the unconstrained versions of the ridge minimization problem). Is then
the predictor
1 1
f^ = f^1 + f^2
2 2
in general a ridge regression predictor? (Make a careful/convincing argument
one way or the other.)

284
9. (6HW-15) Show the equivalence of the two forms of the optimization used
to produce the …tted ridge regression parameter. (That is, show that there is a
ridge ridge ridge ridge
t ( ) such that ^ =^ and a (t) such that ^
t( ) =^ t .) (t)

10. (5E1-16) (Ridge regression produces a "grouping e¤ect" for highly corre-
lated predictors) Suppose that in a p-variable SEL prediction problem, input
variables x1 ; x2 ; x3 have very large absolute correlations. Upon standardization
(and arbitrary change of signs of the standardized variables so that all correla-
tions are positive) the variables are essentially the same, and every combination
3
X
wj x00j for w1 ; w2 ; w3 with w1 + w2 + w3 = 1
j=1

is essentially the same. So every set of coe¢ cients 1 ; 2 ; 3 with a given sum
P
3
00
B = 1 + 2 + 3 has nearly the same j xj . Argue then that any minimizer
j=1
!!
PN Pp P
p
of yi 0+ jx
00
+ j
2
has ^ridge
j
^ridge
1
^ridge .
2 3
i=1 j=1 j=1

11. (5HW-18) For the situation of Problem 7 of Section A.7 (with centered
response and standardized inputs x1 and x2 ) do the following concerning linear
predictors
f^ (x1 ; x2 ) = b1 x1 + b2 x2
a) Plot on the same set of axes the two values b1 and b2 as functions of
(or ln if that is easier to compute or interpret) for ridge regression predictors.
b) Plot on the same set of axes the two values b1 and b2 as functions of
(or ln if that is easier to compute or interpret) for lasso regression predictors.

A.11 Section 3.2 Exercises


1. (6E1-11) As it turns out
2 1 1 1 3 2 3
p p p p1 0 p1
2 2 20 2 20
6 0 p1 p1 + p1 7 6 0 p1 p1 72 32
1 0 0
3
6 2 2 20 7 6 2 20 7 1 0 0
6 p1 p1 p1 7 6 p1 p1 74 p1 p1
6 0 2 2
+ 20 7=6 0 2 20 7 0 1
2 0 54 0 2 2
5
6 7 6 7 1 p1 p1
4 p1 p1 p1 5 4 p1 0 p1 5 0 0 0
2 2 20 2 20 2 2 2
0 0 p4 0 0 p4
20 20

Consider a p = 3 linear prediction problem where the matrix of training inputs,


X, is the matrix on the left above and Y 0 = (4; 2; 2; 0; 2).
a) Find the single principal component (M = 1) …tted coe¢ cient vector
^ p cr .
b) Find the single component (M = 1) partial least squares vector of pre-
pls
dictions, Y^ .

285
2. (6HW-11) Beginning in its Section 5.6, Izenman’s book uses an example
where PET yarn density is to be predicted from its NIR spectrum. This is a
problem where N = 21 data vectors xj of length p = 268 are used to predict
the corresponding outputs yi . Izenman points out that the yarn data are to
be found in the pls package in R. (The package actually has N = 28 cases.
Use all of them in the following.) Get those data and make sure that all inputs
are standardized and the output is centered. (Use the N divisor for the sample
variance.)
a) Using the pls package, …nd the 1; 2; 3; and 4-component PCR and PLS
^ vectors.
b) Find the singular values for the matrix X and use them to plot the
function df( ) for ridge regression. Identify values of corresponding to e¤ective
degrees of freedom 1; 2; 3; and 4. Find corresponding ridge ^ vectors.
c) Plot on the same set of axes ^j versus j for the PCR, PLS and ridge vectors
for number of components/degrees of freedom 1. (Plot them as "functions,"
connecting consecutive plotted j; ^j points with line segments.) Then do the
same for 2; 3; and 4 components/degrees of freedom.
d) It is (barely) possible to …nd that the best (in terms of R2 ) subsets of M =
1; 2; 3; and 4 predictors for OLS are respectively, fx40 g,fx212 ; x246 g,fx25 ; x160 ; x215 g,
and fx160 ; x169 ; x231 ; x243 g. Find their corresponding coe¢ cient vectors. Use
the lars package in R and …nd the lasso coe¢ cient vectors ^ with exactly
P ^lasso
268
M = 1; 2; 3; and 4 non-zero entries with the largest possible j (for the
j=1
counts of non-zero entries).
e) If necessary, re-order/sort the cases by their values of yi (from smallest to
largest) to get a new indexing. Then plot on the same set of axes yi versus i and
y^i versus i for ridge, PCR, PLS, best subset, and lasso regressions for number
of components/degrees of freedom/number of nonzero coe¢ cients equal to 1.
(Plot them as "functions," connecting consecutive plotted (i; yi ) or (i; y^i ) points
with line segments.) Then do the same for 2; 3; and 4 components/degrees of
freedom/counts of non-zero coe¢ cients.
f ) Use the glmnet package in R to do ridge regression and lasso regression
here. Find the value of for which your lasso coe¢ cient vector in d) for M = 2
optimizes the quantity
N
X 268
X
2 ^j
(yi y^i ) +
i=1 j=1

(by matching the error sums of squares). Then, by using the trick of Problem
1 Section A.10 employ the package to …nd coe¢ cient vectors ^ optimizing
0 1
XN 268
X 268
X
2
(yi y^i ) + @(1 ) ^j + ^2 A
j
i=1 j=1 j=1

for = 0; :1; :2; : : : ; 1:0. What e¤ective degrees of freedom are associated with
the = 1 version of this? How many of the coe¢ cients j are non-zero for each

286
of the values of ? Compare error sum of squares for the raw elastic net pre-
dictors to that for the linear predictors using (modi…ed elastic net) coe¢ cients

(1 + ) ^enet
;

3. (5HW-18) For the situation of Problem 11 of Section A.10 …nd 1-component


PCA and PLS predictors.

4. (5E1-14) In a SEL prediction problem with N = 22, p = 5 standardized


predictor variables produce input matrix X for centered response vector Y .
22 5 22 1
The singular values of X are

5:970; 5:579; 4:583; 4:132; and :397

and some matrix products are

Y 0 XX 0 Y = 10:27; Y 0 XX 0 XX 0 Y = 312:2; and


Y 0 U = ( :145; :53; :026; :209; :112)

for U from the singular value decomposition of X.


a) What are the e¤ective degrees of freedom associated with ridge regression
in this context for ridge parameter = 2?
p cr
b) Write the M = 1 component PCR prediction vector Y^ as a function
of the …rst column of U , say u1 .
pls
c) Write the M = 1 component PLS prediction vector Y^ as a function of
the vector XX 0 Y , say w.

A.12 Section 4.1 Exercises


1. (6E1-17) Below are N = 8 training cases (xi ; yi ) for x 2 [0; 1] and a
corresponding "design matrix" holding values of the …rst 8 Haar basis functions
(in the order '; ; 1;0 ; 1;1 ; 2;0 ; 2;1 ; 2;2 ; 2;3 ) for the xi . Consider prediction
based on the values of the 8 Haar basis functions.
2 3 2 3 2 p 3
1=16 2 1 1 p 2 0 2 0 0 0
6 3=16 7 6 1 7 6 1 1 2 0 2 0 0 0 7
6 7 6 7 6 p 7
6 5=16 7 6 3 7 6 1 1 2 0 0 2 0 0 7
6 7 6 7 6 p 7
6 7=16 7 6 2 7 6 7
6 1 1 2 0 0 2 0 0 7
x =6 7 6 7
6 9=16 7 8y 1 = 6 4 7 8X1 = 6
p 7
8 1 6 7 6 7 6 1 1 0 p2 0 0 2 0 7
6 11=16 7 6 3 7 6 7
6 7 6 7 6 1 1 0 2 0 0 2 0 7
4 13=16 5 4 5 5 6 p 7
4 1 1 0 2 0 0 0 2 5
p
15=16 4 1 1 0 2 0 0 0 2

a) Find the OLS prediction vector y^ols here. (This is trivial. Note that the
8 columns of X are orthogonal.)

287
b) Find the 1-component PLS prediction vector y^pls here.
c) After normalizing the predictors (so that the <8 norm of each column
of the normalized X is 1) …nd the lasso prediction vector y^lasso for the penalty
parameter = 10. (Center the vector of responses, remove the …rst column of
the X and work with an 8 7 vector of inputs.)
d) Using the normalized version of the predictors referred to in part c) …nd
a vector of coe¢ cients b that minimizes
0
(y Xb) (y Xb) + b0 diag (0; 0; 0; 4; 4; 4; 4) b

2. (5HW-14) Return to the context of Problem 13 of Section A.2. Make up


a matrix of inputs based on x consisting of the values of Haar basis functions
up through order m = 3. (You will need to take the functions de…ned on [0; 1]
and re-scale their arguments to [ ; ]. For a function g : [0; 1] ! < this is the
function g : [ ; ] ! < de…ned by g (x) = g 2x + :5 .) This will produce a
100 16 matrix X h .
ols
a) Find ^ and plot the corresponding y^s as a function of x with the data
also plotted in scatterplot form.
b) Center y and standardize the columns of Xh . Find the lasso coe¢ cient
vectors ^ with exactly M = 2; 4; and 8 non-zero entries with the largest possible
P
16
^lasso (for the counts of non-zero entries). Plot the corresponding y^s as a
j
j=1
function of x on the same set of axes, with the data also plotted in scatterplot
form.

3. (6HW-15) For an N = 100 dataset made up for Problem 17 of Section


A.2 make up a matrix of inputs based on x consisting of the values of Haar basis
functions up through order m = 3. This will produce a 100 16 matrix X h .
ols
a) Find ^ and plot the corresponding y^s as a function of x with the data
also plotted in scatterplot form.
b) Center y and standardize the columns of X h . Find the lasso coe¢ cient
vectors ^ with exactly M = 2; 4; and 8 non-zero entries with the largest possible
P
16
^lasso (for the counts of non-zero entries). Plot the corresponding y^s as a
j
j=1
function of x on the same set of axes, with the data also plotted in scatterplot
form.
4. (5E1-20) Here consider (square integrable) functions on the unit interval
(0; 1). Four such functions are
g1 (x) = I [0 < x < 1] ; g2 (x) = I [0 < x < :5] ; g3 (x) = I [0 < x < :25] ; g4 (x) = I [:5 < x < :75]
(Ignore the values x = :25; :5; and :75. They have 0 probability and are a
R1
nuisance.) Using the "L2 " inner product de…ned by hf; gi 0
f (x) g (x) dx
use the Gram-Schmidt process to make four orthonormal functions from these,
say h1 (x) ; h2 (x) ; h3 (x) ; h4 (x) and say how they are related to the …rst 4 Haar
basis functions.

288
A.13 Section 4.2 Exercises
1. (6HW-11) Find a set of basis functions for the natural (linear outside the
interval ( 1 ; K )) quadratic regression splines with knots at 1 < 2 < < K.

2. (6HW-11) (B-Splines) For a < 1 < 2 < < K < b consider the B-
spline bases of order m, fBi;m (x)g de…ned recursively as follows. For j < 1
de…ne j = a, and for j > K let j = b. De…ne
Bi;1 (x) = I [ i 6x< i+1 ]

(in case i = i+1 take Bi;1 (x) 0) and then


x i i+m x
Bi;m (x) = Bi;(m 1) (x) + Bi+1;(m 1) (x)
i+m 1 i i+m i+1

(where we understand that if Bi;l (x) 0 its term drops out of the expression
above). For a = 0:1 and b = 1:1 and i = (i 1) =10 for i = 1; 2; : : : ; 11,
plot the non-zero Bi;3 (x). Consider all linear combinations of these functions.
Argue that any such linear combination is piecewise quadratic with …rst deriva-
tives at every i . If it is possible to do so, identify one or more
P linear constraints
on the coe¢ cients (call them ci ) that will make qc (x) = ci B3;i (x) linear to
i
the left of 1 (but otherwise minimally constrain the form of qc (x)).

3. (5E1-14) Suppose one desires to …t a function to N data pairs (xi ; yi )


that is linear outside the interval [0; 1], is quadratic in each of the intervals
[0; :5] and [:5; 1] and has a …rst derivative for all x (has no sharp corners).
Specify 4 functions h1 (x) ; h2 (x) ; h3 (x) ; and h4 (x) and one linear constraint
on coe¢ cients 0 ; 1 ; 2 ; 3 ; and 4 so that the function
y= 0 + 1 h1 (x) + 2 h2 (x) + 3 h3 (x) + 4 h4 (x)
is of the desired form.

4. (6E1-11) Consider a toy p = 1 SEL prediction problem with training data


below.
x 1:0 :75 :50 :25 0 :25 :50 :75 1:0
y 0 2 3 5 4 4 2 2 1
Set up an X matrix for ordinary multiple linear regression that could be used
to …t a linear regression spline with knots at 1 = :5; 2 = 0; and 3 = :5. For
your set-up, what linear combination of …tted regression parameters produces
the prediction at x = 0?

5. (5HW-14) For the dataset of Problem 13 of Section A.2 make up a 100 7


matrix X h of inputs based on x consisting of the values of basis functions for
natural cubic splines with knots j
h1 (x) = 1; h2 (x) = x; and for j = 1; 2; : : : ; K 2

289
3 K j 3 K 1 j 3
hj+2 (x) = (x j )+ (x K 1 )+ + (x K )+
K K 1 K K 1

for the K = 7 knot values

1 = 3:0; 2 = 2:0; 3 = 1:0; 4 = 0:0; 5 = 1:0; 6 = 2:0; 7 = 3:0


ols
Find ^ and plot the corresponding natural cubic regression spline, with the
data also plotted in scatterplot form.

6. (6HW-15) For the dataset of Problem 17 of Section A.2 make up a 100 7


matrix X h of inputs based on x consisting of the values of basis functions for
natural cubic splines with knots j of the general form given in Problem 5 above
for the K = 7 knot values

1 = 0; 2 = :1; 3 = :3; 4 = :5; 5 = :7; 6 = :9; 7 = 1:0


ols
Find ^ and plot the corresponding natural cubic regression spline, with the
data also plotted in scatterplot form.

7. (6HW-17) You instructor will provide a dataset giving the maximum num-
bers of home runs hit by a "big league" professional baseball player in the US for
each of 145 consecutive seasons. Consider these as values y1 ; y2 ; : : : ; y145 and
take xi = i. Consider the basis functions for natural cubic splines with knots
j of the general form in Problem 5 above. Using knots j = 2 + (j 1) 14 for
j = 1; 2; : : : ; 11 …t a natural cubic regression spline to the home run data. Plot
the …tted function on the same axes as the data points.

A.14 Section 4.3 Exercises


1. (6HW-13) Consider the space of continuous functions on [0; 1] [0; 1] that
are linear (i.e. are of the form y = a + bx1 + cx2 ) on each of the squares

S1 = [0; :5] [0; :5] ; S2 = [0; :5] [:5; 1] ; S3 = [:5; 1] [0; :5] ; and S4 = [:5; 1] [:5; 1]

a) Find a set of basis functions for the space described above.


b) Your instructor will send you a dataset generated from a model with

E [yjx1 ; x2 ] = 2x1 x2

Find the best …tting linear combination of the basis functions according to least
squares.
c) Describe a set of basis functions for all continuous functions on [0; 1] [0; 1]
that for

0= 0 < 1 < 2 < < K 1 < K = 1 and 0 = 0 < 1 < < M 1 < M =1

are linear on each rectangle Skm = [ k 1 ; k ] [ m 1 ; m ]. How many such basis


functions are needed to represent these functions?

290
A.15 Section 5.1 Exercises
1. (6HW-11) Suppose that a < x1 < x2 < < xN < b and s (x) is a
natural cubic spline with knots at the xi interpolating the points (xi ; yi ) (i.e.
s (xi ) = yi ).
a) Let z (x) be any twice continuously di¤erentiable function on [a; b] also
interpolating the points (xi ; yi ). Show that
Z b Z b
2 2
(s00 (x)) dx 6 (z 00 (x)) dx
a a

(Hint: Consider d (x) = z (x) s (x), write


Z b Z b Z b Z b
2 2 2
(d00 (x)) dx = (z 00 (x)) dx (s00 (x)) dx 2 s00 (x) d00 (x) dx
a a a a

000
and use integration by parts and the fact that s (x) is piecewise constant.)
P
N
2 Rb 2
b) Use a) and prove that the minimizer of (yi h (xi )) + a (h00 (x)) dx
i=1
over the set of twice continuously di¤erentiable functions on [a; b] is a natural
cubic spline with knots at the xi .

2. (5HW-16) For p = 1 suppose that N observations (xi ; yi ) have distinct xi ,


and for simplicity of notation, suppose that x1 < x2 < < xN . Consider the
basis functions for natural cubic splines with K knots j given in Section 4.2:

h1 (x) = 1; h2 (x) = x; and for j = 1; 2; : : : ; K 2

3 K j 3 K 1 j 3
hj+2 (x) = (x j )+ (x K 1 )+ + (x K )+
K K 1 K K 1

Take K = N and j = xj for j = 1; 2; : : : ; N . Obviously, h1 and h2 have


second derivative functions that are everywhere 0 and the products of these
second derivatives with themselves or 2nd derivatives of other basis functions
must have 0 integral from a to b.
Then for j = 1; 2; 3; : : : ; N 2

h00j+2 (x) = 6 (x xj ) I [xj 6 x 6 xN 1]


6 x 6 xN ]
xN xj
+ 6 (x xj ) xN xN 1
(x xN 1) I [xN 1

xN ) I [xN 6 x 6 b]
xN xj xN 1 xj
+ 6 (x xj ) 1
(x xN xN xN 1) + xN xN 1 (x
= 6 (x xj ) I [xj 6 x 6 xN 1 ]
6 x 6 xN ]
xj xN 1 xN xj
+6 x xN xN 1
+ xN 1 xN xN 1
xj I [xN 1

xj ) I [xj 6 x 6 xN 6 x 6 xN ]
xj xN 1
= 6 (x 1] + 6 (x xN ) xN xN 1
I [xN 1

291
Thus for j = 1; 2; 3; : : : ; N 2
R b 00 2
a
hj+2 (x) dx
2
3 3 xN 1 xj
= 12 (xN 1 xj ) + (xN xN 1) xN xN 1
3 2
= 12 (xN 1 xj ) + (xN xN 1 ) (xN 1 xj )
2
= 12 (xN 1 xj ) (xN xj )

and for positive integers 1 6 j < k 6 N 2


Rb
a
h00j+2 (x) h00k+2 (x) dx
Rx R xN 2 (xj xN 1 )(xk xN 1)
= 36 xkN 1 (x xj ) (x xk ) dx + xN 1
(x xN ) (xN xN 1 )2
dx
(xN 1 xk )3 (xN 1 xk )2 (xj xN 1 )(xk xN 1 ) (xN 1 xN ) 3
= 36 3 + (xk xj ) 2 36 (xN xN 1 )2 3
2
= 6 (xN 1 xk ) (2 (xN 1 xk ) + 3 (xk xj )) 12 (xj xN 1 ) (xk xN 1 ) (xN 1 xN )
2
= 6 (xN 1 xk ) (2xN 1 + xk 3xj ) + 12 (xN 1 xk ) (xN 1 xj ) (xN xN 1 )

Do the smoothing spline computations for the dataset of Problem 11 Section


A.9 "from scratch" using the above representations of the entries of the matrix
. That is,
a) Compute the 11 11 matrix .
b) For = 1; 10 1 ; 10 2 ; 10 3 ; 10 4 ; 10 5 ; and 0 compute the smoother
matrices S and the e¤ective degrees of freedom.
c) Find the penalty matrix K and its eigen decomposition. Plot as functions
of xi (or just i assuming that you have ordered the values of x) the entries of
the eigenvectors of this matrix (connect successive points with line segments so
that you can see how these change in character as the corresponding eigenvalue
of K increases— the corresponding eigenvalue of S decreases). Which <11
components of the observed Y are most suppressed in the smoothing operation?
Can you describe them in qualitative terms?

A.16 Section 5.2 Exercises


1. (6HW-11) A p = 2 dataset provided with these notes consists of N = 441
2
training vectors (x1i ; x2i ; yi ) for the distinct pairs (x1i ; x2i ) in the set f 1:0; :9; : : : ; :9; 1:0g
where the yi were generated as
sin (10 (x1i + x2i ))
yi = + i
10 (x1i + x2i )
2
(with the convention that sin (0) =0 = 1) for iid N 0; (:02) variables i .
a) Why should you expect MARS to be ine¤ective in producing a predictor
in this context? (You may want to experiment with the earth package in R
trying out MARS.)
b) Fit a thin plate spline to these data using the Tps function in the fields
package. Contour plot your results.

292
2. (6HW-13) A p = 2 dataset provided with these notes consists of N = 81
2
training vectors (x1i ; x2i ; yi ) for pairs (x1i ; x2i ) in the set f 2:0; 1:5; : : : ; 1:5; 2:0g
where the yi were generated as

yi = x21i + x22i = 1 + x21i + x22i + i

2
for iid N 0; (:1) variables i . Use it in the following.
a) Why should you expect MARS to be ine¤ective in producing a predictor
in this context? (You may want to experiment with the earth package in R
trying out MARS.)
b) Fit a thin plate spline to these data using the Tps function in the fields
package.

A.17 Section 5.3 Exercises


1. (6E1-13) Return to the scenario of Problem 3 of Section A.10.
p enalty
a) Find Y^ 2 <8 optimizing
8
X
0 2 2 2 2
(Y v) (Y v) + hv;x2 i + 2 hv; x3 i + hv; x4 i +4 v; xj
j=5

over choices of v 2 <8 .


b) Find an 8 8 smoother matrix S corresponding to the penalty in a) (a
p enalty
matrix so that for any Y 2 <8 a Y^ optimizing the form in part a) is SY )
and plot values in the 4th row of this matrix against x :500.

A.18 Section 6.1 Exercises


1. (6HW-11) Suppose that with p = 1,

sin (12 (x + :2))


yjx N ;1
x + :2

and N = 101 training data pairs are available with xi = (i 1) =100 for i =
1; 2; : : : ; 101. A dataset like this is provided with these notes. Use it in the
following.
a) Fit all of the following using …rst 5 and then 9 e¤ective degrees of freedom

a cubic smoothing spline,

a locally weighted linear regression smoother based on a normal density


kernel, and
a locally weighted linear regression smoother based on a tri-cube kernel.

293
Plot for 5 e¤ective degrees of freedom all of yi and the 3 sets of smoothed
values against xi . Connect the consecutive (xi ; y^i ) for each …t with line segments
so that they plot as "functions." Then redo the plotting for 9 e¤ective degrees
of freedom.
b) For all of the …ts in a) plot as a function of i the coe¢ cients ci applied to
P
101
the observed yi in order to produce f^ (x) = ci yi for x = :05; :1; :2; :3. (Make
i=1
a di¤erent plot of three curves for 5 degrees of freedom and each of the values
x (four in all). Then redo the plotting for 9 degrees of freedom.)

2. (6HW-13) Suppose that with p = 1,

1:5 2
yjx N sin + exp( 2x); (:5)
x + :1

(the conditional standard deviation is :5) and N = 101 training data pairs are
available with xi = (i 1) =100 for i = 1; 2; : : : ; 101. A dataset like this is
provided with these notes. Use it in place of the dataset described in Problem
1 above and redo all of that problem.

3. (6E1-11) Suppose that P is such that x has pdf

3 1 1 1
p (x) = I 0<x< + I <x<1 on [0; 1]
2 2 2 2

and the conditional distribution of yjx is N(x; 1). Suppose training data (xi ; yi )
for i = 1; : : : ; N are iid P and that with the standard normal pdf, one uses
the Nadaraya-Watson estimator for E[yjx = :5] = :5,

P
N
yi (:5 xi )
i=1
f^ (:5) =
PN
(:5 xi )
i=1

Use the law of large numbers and the continuity of the ratio function and write
out the (in probability) limit for f^ (:5) in terms of a ratio of two de…nite integrals
and then argue that the limit is not :5.

4. (6E1-11) Consider a toy problem where one is to employ locally weighted


straight line regression smoothing based on the Epanechnikov quadratic kernel
in a p = 1 context with training data given in Problem 4 of Section A.11. Using
a bandwidth of = :5, give a small (augmented) dataset for which ordinary
simple linear regression (OLS) will produce the smoothed prediction at x = 0
(that is, f^:5 (0)) for the original training data.

5. (6E1-13) Return to the scenario of Problem 1 of Section A.17. If one accepts


the statistical conventional wisdom that (generalized) "spline" smoothing is

294
nearly equivalent to kernel smoothing, in light of your plot in b) of that problem
identify a kernel that might provide smoothed values similar to those for the
penalty used there. (Name a kernel and choose a bandwidth.)

6. (6HW-15) In a p = 1 smoothing context like that of Problem 2 of Section


A.15, where N = 11 training data pairs (xi ; yi ) have x1 = 0; x2 = :1; x3 =
:2; : : : ; x11 = 1:0, consider locally weighted linear regression based on a Gaussian
kernel.
a) Compute and plot e¤ective degrees of freedom as a function of the band-
width, . (It may be most e¤ective to make the plot with on a log scale or
some such.) Do simple numerical searches to identify values of corresponding
to e¤ective degrees of freedom 2:5; 3; 4;and 5.
b) Compare the smoothing matrix "S " for Problem 2 of Section A.15 for
4 e¤ective degrees of freedom to the matrix "L " in the present context also
producing 4 e¤ective degrees of freedom. What is the 11 11 matrix di¤erence?
Plot, as a function of column index, the values in the 1st, 3rd, and 5th rows of
the two matrices, connecting with line segments successive values from a given
row. (Connect consecutive plotted points for a given row of a given matrix and
use di¤erent plotting symbols, colors, and/or line weights and types so that you
can make qualitative comparisons of the nature of these.)

7. (6E1-17) Consider a 1-d N-W smoothing problem on [0; 2] for values of


xj = :1 (j 1) for j = 1; 2; : : : ; 21. Suppose that one uses weights
8
< :5 if i = j
w (ji jj) = :25 if ji jj = 1
:
0 otherwise

to make smoothed values


X21
y^j = w (ji jj) yi
i=1

except for the "edge" cases where we’ll take y^1 = :5y1 + :5y2 and y^21 = :5y20 +
:5y21 .
a) For S the smoother matrix to be applied to a vector of observations Y =
(y1 ; y2 ; : : : ; y21 ) to get smoothed values, what are e¤ective degrees of freedom?
b) What are (except for the "edge" cases, now with indices j = 1; 2; 20; and
21) the weights, say w2 (ji jj), used to make "doubly smoothed" values via
two successive applications of the original smoothing. That is, forY^ = SSY ?
What (approximately, you don’t need to get exactly the right terms for the edge
cases) are e¤ective degrees of freedom for SS?
c) Consider local linear regression in this same context, where the original
weights are used and thus (except for edge cases) the slope and intercept used
to make y^j are determined by minimizing
2 2 2
:25 (yj 1 ( 0 + 1 xj 1 )) +:5 (yj ( 0 + 1 xj )) +:25 (yj+1 ( 0 + 1 xj+1 ))

295
(or equivalently 4 times this quantity). Ultimately (again except for edge
cases) what weights go into a smoother matrix for an "equivalent N-W ker-
nel smoother" in this case? (It may be helpful to recall that OLS for SLR
PN P
N
2
produces b1 = (yi y) (xi x) = (xi x) and b0 = y b1 x.)
i=1 i=1
d) Use R to compute the nth power of S for a reasonably large n. Why is
this form really no surprise?

8. (6E1-15) Here is (a rounded version of) a smoother matrix S , for a N-W


smoother with Gaussian kernel for data with x0 = (0; 0:1; 0:2; : : : ; 0:8; 0:9; 1:0).
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] 0.47 0.35 0.14 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[2,] 0.26 0.35 0.26 0.11 0.02 0.00 0.00 0.00 0.00 0.00 0.00
[3,] 0.10 0.23 0.31 0.23 0.10 0.02 0.00 0.00 0.00 0.00 0.00
[4,] 0.02 0.09 0.23 0.31 0.23 0.09 0.02 0.00 0.00 0.00 0.00
[5,] 0.00 0.02 0.09 0.23 0.31 0.23 0.09 0.02 0.00 0.00 0.00
[6,] 0.00 0.00 0.02 0.09 0.23 0.31 0.23 0.09 0.02 0.00 0.00
[7,] 0.00 0.00 0.00 0.02 0.09 0.23 0.31 0.23 0.09 0.02 0.00
[8,] 0.00 0.00 0.00 0.00 0.02 0.09 0.23 0.31 0.23 0.09 0.02
[9,] 0.00 0.00 0.00 0.00 0.00 0.02 0.10 0.23 0.31 0.23 0.10
[10,] 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.11 0.26 0.35 0.26
[11,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.14 0.35 0.47

a) Approximately what bandwidth ( ) and e¤ective degrees of freedom are


associated with this matrix?
b) A rounded version of the matrix product S S is below. Thinking
of this product as itself a smoother matrix, what might you think of as "an
equivalent kernel"? (Give values of weights w (ji jj) for i; j indices 1 to 11 so
P11
that y^j i=1 w (ji jj) yi .)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] 0.33 0.32 0.21 0.10 0.03 0.01 0.00 0.00 0.00 0.00 0.00
[2,] 0.24 0.28 0.24 0.14 0.07 0.02 0.01 0.00 0.00 0.00 0.00
[3,] 0.14 0.21 0.24 0.20 0.12 0.06 0.02 0.01 0.00 0.00 0.00
[4,] 0.06 0.13 0.19 0.22 0.19 0.12 0.06 0.02 0.01 0.00 0.00
[5,] 0.02 0.06 0.12 0.19 0.22 0.19 0.12 0.06 0.02 0.01 0.00
[6,] 0.01 0.02 0.06 0.12 0.19 0.22 0.19 0.12 0.06 0.02 0.01
[7,] 0.00 0.01 0.02 0.06 0.12 0.19 0.22 0.19 0.12 0.06 0.02
[8,] 0.00 0.00 0.01 0.02 0.06 0.12 0.19 0.22 0.19 0.13 0.06
[9,] 0.00 0.00 0.00 0.01 0.02 0.06 0.12 0.20 0.24 0.21 0.14
[10,] 0.00 0.00 0.00 0.00 0.01 0.02 0.07 0.14 0.24 0.28 0.24
[11,] 0.00 0.00 0.00 0.00 0.00 0.01 0.03 0.10 0.21 0.32 0.33

c) Here is a bit of R code and more output for this problem.


>round(eigen(S)$values,3)
[1] 1.000 0.921 0.730 0.509 0.317 0.176 0.087 0.038 0.015 0.005 0.001
>round(eigen(S)$vectors[,1],3)

296
[1] -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 -0.302
-0.302

While S is not symmetric, it is non-singular and has 11 real eigenvalues


1 = d1 > d2 > > d11 > 0 with corresponding linearly independent unit eigen-
vectors u1 ; u2 ; : : : ; u11 such that S uj = dj uj . So with U = (u1 ;u2 ; : : : ;u11 )
and D = diag (d1 ; d2 ; : : : ; d11 ) we have S U = U D and S = U DU 1 . The
output above provides the eigenvalues and u1 .
The nth power of S , S n , has a limit. What is it? Argue that your answer is
correct. What are the corresponding limits of S n Y and of the e¤ective degrees
of freedom of S n ?

9. (6HW-17) Consider again the home run dataset of Problem 7 Section A.13.
Fit with …rst approximately 5 and then 9 e¤ective degrees of freedom

a cubic smoothing spline (using smooth.spline()), and


a locally weighted linear regression smoother based on a tri-cube kernel
(using loess(...,span= ,degree=1)) to the home run data.

Plot for approximately 5 e¤ective degrees of freedom all of yi and the 2 sets
of smoothed values against xi . Connect the consecutive (xi ; y^i ) for each …t with
line segments so that they plot as "functions." Then redo the plotting for 9
e¤ective degrees of freedom.

10. (6HW-19) Consider again the fake data of Problem 2 of Section A.15.
Carry out the steps of Problem 9 above on this dataset.

11. (5E1-14) Below is a particular smoother matrix, S, for p = 1 data at


values x = 0; :1; :2; :3; : : : ; :9; 1:0 (The labeling convention used below is x1 =
0; x2 = :1; x3 = :2; : : : ; x11 = 1:0.)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] .721 .265 .013 .000 .000 .000 .000 .000 .000 .000 .000
[2,] .210 .570 .210 .010 .000 .000 .000 .000 .000 .000 .000
[3,] .010 .208 .564 .208 .010 .000 .000 .000 .000 .000 .000
[4,] .000 .010 .208 .564 .208 .010 .000 .000 .000 .000 .000
[5,] .000 .000 .010 .208 .564 .208 .010 .000 .000 .000 .000
[6,] .000 .000 .000 .010 .208 .564 .208 .010 .000 .000 .000
[7,] .000 .000 .000 .000 .010 .208 .564 .208 .010 .000 .000
[8,] .000 .000 .000 .000 .000 .010 .208 .564 .208 .010 .000
[9,] .000 .000 .000 .000 .000 .000 .010 .208 .564 .208 .010
[10,] .000 .000 .000 .000 .000 .000 .000 .010 .210 .570 .210
[11,] .000 .000 .000 .000 .000 .000 .000 .000 .013 .265 .721

a) What e¤ective degrees of freedom are associated with this smoother?


b) Approximately what bandwidth is associated with this smoother?

297
c) For training data as below, what is f^ (:4)?

y 1 3 2 4 2 6 7 9 7 8 6
x 0 :1 :2 :3 :4 :5 :6 :7 :8 :9 1

A.19 Section 6.2 Exercises


1. (6HW-13) Apply 2-d locally weighted regression smoothing on the dataset
of Problem 1 of Section A.16 using the loess() function in R. "Surface plot/
perspective plot" this for 2 di¤erent choices of smoothing parameters along with
both the raw data and the mean function. (If nothing else, JMP will do this under
its "Graph" menu.)

2. (6HW-13) Apply 2-d locally weighted regression smoothing on the dataset


of Problem 2 of Section A.16 using the loess() function in R. "Surface plot/
perspective plot" this for 2 di¤erent choices of smoothing parameters along with
both the raw data and the mean function.

3. (5E1-16) Consider the small (N = 5) training set for a p = 2 SEL prediction


problem given in the table below and represented in the corresponding plot.

x1 x2 y
1 0 4
0 1 2
0 0 0
0 1 8
1 0 6

a) Find the OLS predictor of y of the form y^ = f^ (x) = b0 +b1 x1 +b2 x2 . Show
"by hand"qcalculations. Note
q that predictors x1 and x2 can be standardized
q
5 5 1
to x01 = 0
2 x1 and x2 =
00
2 x2 and made orthonormal as x1 = 2 x1 and
q
x002 = 12 x2 .
b) Consider the penalized least squares problem of minimizing (for ortho-
normal predictors x001 and x002 ) the quantity
5
X
00 00 2
(yi ( 0 + 1 x1 + 2 x2 )) + (j 1j +j 2 j)
i=1

Plot on the same set of axes minimizers ^1lasso and ^2lasso as functions of .
pls
c) Evaluate the …rst PLS component z 1 in this problem and …nd ^ 2
2
< (for centered y values and standardized predictors so that the matrix of
pls
predictors x0 is 5 2) so that Y^ pls = X ^ for a 1-component PLS predictor.
Show "by hand" calculations.

298
d) Since standardization requires multiplying x1 and x2 by the same con-
stant, the 3-nn predictor here is the same whether computed on the raw (x1 ; x2 )
values or after standardization. What is it? (It takes on only a few di¤erent
values. Give those values and specify the regions in which they pertain in terms
of the original variables.)
e) (Again, since standardization requires multiplying x1 and x2 by the same
constant) 2-d kernel smoothing methods applied on original and standardized
scales are equivalent. So consider locally weighted bivariate regression done
on the original scale using the Epanechnikov quadratic kernel and bandwidth
= 1. Write out (in completely explicit terms) the sum to be optimized by
choice of constants 0 ; 1 ; 2 in order to produce a prediction of the form y^ =
1 1
0 + 1 x1 + 2 x2 for the input vector 2 ; 2 . What is the value of this prediction?

A.20 Section 7.1 Exercises


1. (6HW-11) Consider again the situation of Problem 1 Section A.16. If
you were going to use a structured kernel and 1-d smoothing to produce a
predictor here, what form for the matrix A would work best? What would be
a completely ine¤ective choice of a matrix A? Use the good choice of A and
produce a corresponding set of predictions.

A.21 Section 8.1 Exercises


1. (6HW-11) Use JMP to do neural net …tting (with logistic sigmoidal function,
( )) for the dataset in Problem 1 of Section A.18.
a) Find a neural net with an error sum of squares about like those for
the 9 degrees of freedom …ts in Section A.18. Provide appropriate JMP re-
ports/summaries. You’ll be allowed to vary the number of hidden nodes for a
single-hidden-layer architecture and to vary a weight for a penalty made from a
sum of squares of coe¢ cients. Each run of the routine makes several random
starts of an optimization algorithm. Extract the coe¢ cients from the JMP run
and use them to plot the …tted function of x that you settle on. How does this
compare to the plotted …ts produced in Problem 1 of Section A.18?
b) Try to reproduce what you got from JMP in a) using the R package
neuralnet (or any other you …nd that to work better).

2. (6HW-13) Carry out the steps of Problem 1 above on the data of Problem
2 of Section A.18.

3. (5HW-14) Return to the dataset of Problem 2 of Section A.16. Use the


neural network routines in JMP to …t the data to get an error sum of squares
like you got in Problem 2 of Section A.16. How complicated does the network
architecture have to be in order to do a good job …tting these data? Contour
or surface plot your …ts.

4. (5HW-14) Use all of MARS, thin plate splines, local kernel-weighted lin-
ear regression, and neural nets to …t predictors to both the noiseless and the

299
noisy "hat data." For those methods for which it’s easy to make contour or
surface plots, do so. Which methods seem most e¤ective on this particular
dataset/function?

5. (6E1-11) Consider a p = 2 prediction problem with continuous univariate


output y. Two possible methods of prediction are under consideration, namely

1. a neural net with single hidden layer and M = 2 hidden nodes (and single
output node) using (u) = 1= (1 + exp (u)) and g (v) = v, and
2. a projection pursuit regression predictor with M = 2 summands gm (w0m x)
(based on cubic smoothing splines).

a) Argue carefully that in general, possibility 2 provides more ‡exibility in


…tting than possibility 1.
b) Note that unit vectors in <2 can be parameterized by a single real variable
2 ( ; ]. How would you go about choosing a version of possibility 2
that might be expected to provide only "about as much ‡exibility in …tting"
as possibility 1? (This will have to amount to some speculation, but make a
sensible suggestion based on "parameter counts.")

6. (6E1-13) Consider approximations to "simple functions" (linear combina-


tions of step functions) using single layer feed-forward neural network forms.
First say how you might produce an approximation of a function on <1 that
is an indicator function of any interval, I = (a; b) (…nite or in…nite), say
I [a < x < b]. Then argue that it’s possible to approximate any function of
PM
the form g (x) = cl I [al < x < bl ] on <1 using a neural network form.
l=1

7. (6HW-15) Again use the dataset of Problem 17 of Section A.2.


a) Fit with approximately 5 and then 9 e¤ective degrees of freedom
i) a cubic smoothing spline (using smooth.spline()) , and
ii) a locally weighted linear regression smoother based on a tri-cube
kernel (using loess(...,span=,degree=1)).
Plot for approximately 5 e¤ective degrees of freedom all of yi and the 2 sets
of smoothed values against xi . Connect the consecutive (xi ; y^i ) for each …t with
line segments so that they plot as "functions." Then redo the plotting for 9
e¤ective degrees of freedom.
b) Produce a single hidden layer neural net …t with an error sum of squares
about like those for the 9 degrees of freedom …ts using nnet(). You may need
to vary the number of hidden nodes for a single-hidden-layer architecture and
vary the weight for a penalty made from a sum of squares of coe¢ cients in order
to achieve this. For the function that you ultimately …t, extract the coe¢ cients
and plot the …tted mean function. How does it compare to the plots made in
a)?
c) Each run of nnet() begins from a di¤erent random start and can produce
a di¤erent …tted function. Make 5 runs using the architecture and penalty

300
parameter (the "decay" parameter) you settle on for part b) and save the 100
predicted values for the 10 runs into 10 vectors. Make a scatterplot matrix of
pairs of these sets of predicted values. How big are the correlations between the
di¤erent runs?
d) Use the avNNet() function from the caret package to average 20 neural
nets with your parameters from part b).

8. (6E2-15) Consider a p = 3 predictor 2-class neural net classi…er, with a


single hidden layer having only 2 nodes.
a) Provide the network diagram for this situation and a corresponding like-
lihood term that might be associated with a training vector (x1i ; x2i ; x3i ; yi )
where y has the 1 versus 1 coding.
b) Suppose that the inputs have been standardized, and completely specify
a lasso-motivated jointly continuous prior distribution for the model parame-
ters that might be expected to promote posterior sparsity/near-sparsity for the
model parameters.

9. (5E1-16) Below is a toy diagram for a very simple single hidden layer
"neural network" mean function of x 2 < (i.e. p = 1). Suppose that out-
puts/responses y are essentially 3 if x < 17 and essentially 8 if 17 < x < 20, and
essentially 3 if x > 20. Identify numerical values of neural network parameters
01 ; 11 ; 02 ; 12 ; 0 ; 1 ; 2 for which the corresponding predictor is a good ap-
proximation of the output mean function. (Here, (u) = 1= (1 + exp ( u)) and
g (z) = z.)

10. (5HW-16) Carry out the steps of Problem 7 of this section using the
dataset of Problem 13 of Section A.2.

11. (5E1-14) A two-hidden-layer (with 2 nodes per hidden layer) single-input-


single-output feed-forward neural network with "activation function" (u) =
tanh (u) for a p = 1 prediction problem is …t to a particular N = 100 training
set. In notation like that used on Figure 25 this …tting results in
2 2 2 2
^ 01 = :005; ^ x1 = :082; ^ 02 = :023; ^ x2 = :036
1 1 1 1 1 1
^ 01 =:0007; ^ 11
= :0004; ^ 21
= :0023; ^ 02 = :0037; ^ 12 = :0155; ^ 22 = :0356
^0 = 1413; ^1 = 50513; ^2 = 850321

301
Plot the SEL predictor of y implied by this set of …tted coe¢ cients, f^ (x).

A.22 Section 8.2 Exercises


1. (6HW-11) Consider radial basis functions built from kernels. In particular,
consider the choice D (t) = (t), the standard normal pdf.
a) For p = 1, plot on the same set of axes the 11 functions

jx jj j 1
K (x; j) =D for j = j = 1; 2; : : : ; 11
10

…rst for = :1 and then (in a separate plot) for = :01. Then make plots on
the a single set of axes the 11 normalized functions

K (x; j )
N j (x) =
P
11
K (x; l )
l=1

…rst for = :1, then in a separate plot for = :01.


b) For p = 2, consider the 121 basis functions

kx ij k i 1 j 1
K (x; ij ) =D for ij = ; i = 1; : : : ; 11 and j = 1; : : : ; 11
10 10

Make contour plots for K:1 (x; 6;6 ) and K:01 (x; 6;6 ). Then de…ne

K (x; ij )
N ij (x) =
P
11 P11
K (x; lm )
m=1 l=1

Make contour plots for N:1;6;6 (x) and N:01;6;6 (x).

2. (6HW-13) Consider again the data of Problem 1 of Section A.18. Fit


(training-set-dependent) radial basis function networks based on the standard
normal pdf ,
101
X kx xi k
f (x) = 0 + iK (x; xi ) for K (x; xi ) =
i=1

to these data for two di¤erent values of . Then de…ne normalized versions of
the radial basis functions as
K (x; xi )
N i (x) =
P
101
K (x; xm )
m=1

and redo the …tting using the normalized versions of the basis functions.

302
3. (5HW-14) Fit radial basis function networks based on the standard normal
pdf ,
81
X kx xi k
f (x) = 0 + iK (x; xi ) for K (x; xi ) =
i=1

to the data of Problem 2 Section A.16 for two di¤erent values of . Then de…ne
normalized versions of the radial basis functions as
K (x; xi )
N i (x) =
P
81
K (x; xm )
m=1

and redo the …tting using the normalized versions of the basis functions.

4. (6HW-15) Fit radial basis function networks based on the standard normal
pdf ,
51
X m 1 jx zj
f (x) = 0 + mK x; for K (x; z) =
m=1
50

to the dataset of Problem 17 of Section A.2 for two di¤erent …xed values of .
De…ne normalized versions of the radial basis functions as
K x; i501
N i (x) =
P
51
K x; m50 1
m=1

and redo the …tting using the normalized versions of the basis functions.

A.23 Section 9.1 Exercises


1. (5E1-18) Use the training set in Problem 3 of Section A.2 without bothering
to center y, carefully build a binary regression tree with 6 …nal nodes (employing
5 splits, each at one of the values :2; :35; :5; :65; and :8). For each split, give the
associated SSE provided by the split. Make a tree diagram for representing
your development. If SSE is penalized by = 6 times the number of tree nodes,
which of the trees met in your construction is most attractive?

2. (6E2-11) Below is a small p = 2 classi…cation training set (for K = 2 classes)


displayed in graphical and tabular forms (circles are class 1 and squares are

303
class 1).

a) Using empirical misclassi…cation rate as your splitting criterion and stan-


dard forward selection, …nd a reasonably simple binary tree classi…er that has
training error rate 0. Provide the tree diagram and sketch the corresponding
rectangles on a plot like the one above.
b) For every sub-tree, T , of your full binary tree above, …nd the size (number
of …nal nodes) of the sub-tree, jT j, and the empirical error rate of its associated
classi…er.
c) Using the values from b), …nd for every > 0 a sub-tree of your full tree
minimizing
C = jT j + err

3. (6E1-13) Consider the p = 1 prediction problem with N = 6 and training


data as below.
y 1:6 :4 3:5 1:5 5 6
x 1 2 3 4 5 6
Forward selection of binary trees for SEL prediction produces the sequence
of trees represented below. If one determines to prune back from the …nal
tree in optimal fashion, there is a nested sequence of subtrees that are the only
possible optimizers of C (T ) = jT j + SSE(T ) for positive . Identify that

304
nested sequence of sub-trees of Tree 5 below.

4. (5HW-14) Your instructor will provide an N = 200 training set generated


from the "Friedman1" benchmark model (using the mlbench package in R. For
purposes of assessing test error, you will also be given a size 5000 test set
generated from this model.
a) Fit a single regression tree to the dataset. Prune the tree to get the
best sub-trees of all sizes from 1 …nal node to the maximum number of nodes
produced by the tree routine. Compute and plot cost-complexities C (T ) as a
function of .
b) Evaluate a "test error" (based on the size 5000 test set) for each sub-tree
identi…ed in a). What size sub-tree looks best based on these values?
c) Do 5-fold cross-validation on single regression trees to pick an appropriate
tree complexity (to pick one of the sub-trees from a)). How does that choice
compare to what you got in b) based on test error for the large test set?

5. (5HW-14) Return to the context of Problem 7 Section A.2 and Problem 1


Section A.5. Fit a classi…cation tree to the dataset using 5-fold cross-validation
to choose tree size based on cost-complexity tuning. Make a plot like that
required in the earlier problems showing the regions where the tree classi…es to
each of the 4 classes. Evaluate the (conditional on the training set) test error
rate for this tree.

6. (5E1-16) Below is a representation of a binary regression tree. Find a


subtree of this tree that minimizes the cost C (T ) = jT j + SSE(T ) for = :01.
(There are 7 subtrees to consider.) Identify the …nal nodes for the optimal

305
subtree.

7. (5E2-14) Consider a context in which for a continuos input x 2 [0; 1] the


conditional mean function E[yjx] is strictly increasing. Argue carefully that any
binary tree predictor must be positively biased at x = 0 and negatively biased
at x = 1 in this context.

A.24 Section 10.1 Exercises


1. (5E1-18) Use the training set in Problem 3 of Section A.2 and without
bothering to center y, consider bagging a SEL predictor for y of the form

f^ (x) = b1 I [x < :5] + b2 I [x > :5]

…t by OLS. Below, B = 10 bootstrap samples are represented in terms of case


indices and the corresponding values of b1 and b2 are provided. Find an OOB
MSPE for a bagged predictor f^bag
10
.

Bootstrap Sample b1 b2
2; 3; 5; 5; 5; 6 7:000 7:000
2; 2; 3; 4; 5; 6 6:000 9:333
1; 1; 1; 1; 2; 6 :8000 10:000
1; 1; 3; 4; 5; 6 3:333 9:333
1; 1; 2; 3; 5; 6 3:500 8:000
1; 1; 2; 4; 4; 5 1:333 10:000
2; 2; 2; 3; 5; 5 4:000 6:000
2; 3; 3; 4; 4; 5 8:000 10:000
1; 2; 2; 2; 3; 4 3:200 12:000
2; 3; 4; 5; 6; 6 7:000 8:666

306
2. (6E1-19) All cases in a particular N = 100 training set are distinct/di¤erent.
Suppose that one is going to make "weighted bootstrap samples" of size 100,
using not equal weights of .01 on each case in the training set, but rather
P
100
weights/probabilities w1 ; w2 ; : : : ; w100 (where each wi > 0 and wi = 1).
i=1
a) What is the probability that a case with wi = :02 is included in a
particular weighted-bootstrap sample of size 100?
b) Suppose that for b = 1; 2; : : : ; B the corresponding weighted bootstrap
sample is T b and the sample mean of responses in this sample is y b . Further,
P
B
let ybag = B1 y b . Find an expression for lim ybag and argue carefully that
b=1 B!1
your expression is correct.

A.25 Section 10.2 Exercises


1. (6E1-13) Consider a p = 1 prediction problem for x 2 [0; 1] and random
forest predictor f^B based on a training set of size N = 101 with xi = (i 1) =100
for i = 1; : : : ; 101 and nm in = 5 (so no split is made in creating a single tree pre-
dictor f^ b that would produce a leaf representing fewer than 5 training points).
a) Use simulation to approximate the expected value of the arithmetic
mean of the 5 largest of 101 values drawn at random with replacement from
f:00; :01; : : : ; 1:00g. Call this value .
b) Consider the bias of prediction at x = 1:00, namely

E f^B (1:00) 1:00

under a model where Eyi = xi . Use your value from a) to argue carefully
that this bias is clearly negative.

2. (5HW-14) Return to the context of Problem 4 of Section A.23.


a) Use the randomForest package in R to make a bagged version of a re-
gression tree (based on, say, B = 500). What is its OOB error? How does that
compare to a test error based on the size 5000 test set?
b) Make a random forest predictor using randomForest (use again B = 500).
What is its OOB error? How does that compare to a test error based on the
size 5000 test set?

3. (6E1-17) If, in a classi…cation problem, all N inputs xi 2 <p are distinct,


a default random forest (one with nm in = 1) will typically have err = 0 (a 0
training error rate for 0-1 loss) unless a "small" maximum tree depth is set.
a) Why is this? Explain.
b) Does this mean that the OOB error rate will be 0? Explain.
c) Does this mean that the OOB error rate is unreliable as a representing
likely random forest performance? Explain.

4. (6E2-15) Below is a small p = 2 classi…cation training set (for 2 classes)


displayed in graphical and tabular forms (circles are class 1 and squares are

307
class 1). A bootstrap sample is made from this dataset and is indicated in the
table and by counts next to plotted points for those points represented in the
sample other than once. This sample is used to create a tree in a random forest
with 4 end nodes (accomplished by 3 binary splits). A random choice is made
for which of the 2 variables to split on at each opportunity and turns out to
produce the sequence "x1 then x1 then x2 ."

a) Identify the resulting tree by rectangles on the plot and provide the value
of y^ for each rectangle.
b) Which out-of-bag points are misclassi…ed by this particular tree?

5. (5E1-16) A variant of the random forest algorithm begins by making a


random p-dimensional rotation of the predictors of a bootstrap sample before
building the tree for that bootstrap sample, f^ b . (You may, for example, think
of this in terms of the p = 2 case for inputs x, and rotating the 2-d coordinate
axes around the origin before doing splitting based on the 2 rotated axes.) What
about this innovation is attractive and what about it is unattractive?

A.26 Section 11.1 Exercises


1. (6HW-11) Below is a very small sample of …ctitious p = 1 training data.

x 1 2 3 4 5
y 1 4 3 5 6

Consider a toy Bayesian model averaging problem where what is of interest


is a prediction for y at x = 3. Suppose that under Model 1, the (xi ; yi ) are iid
where x is Discrete Uniform on f1; 2; 3; 4; 5g and yjx is Binomial(10; p (x)) for
x a
p (x) = b (for the standard normal cdf and b > 0). In this model, the
3 a
quantity of interest is 10 b .
On the other hand, suppose that under Model 2, the (xi ; yi ) are iid where
x is Discrete Uniform on f1; 2; 3; 4; 5g and yjx is Binomial(10; p (x)) for p (x) =
1
1 (c+1)x (for some c > 0). In this model, the quantity of interest is 10
1
1 3(c+1) .

308
For prior distributions, suppose that for Model 1 a are b are a priori in-
dependent with a U(0; 6) and b 1 Exp(1), while for Model 2, c Exp(1).
Further suppose that prior probabilities on the models are (1) = (2) = :5.
Compute (almost surely you’ll have to do this numerically) posterior means of
the quantities of interest in the two Bayes models, posterior probabilities for the
two models, and the overall predictor of y at x = 3.

2. (6E2-11) Consider a Bayesian model averaging problem where x takes


values in f0; 1g and y takes values in f0; 1g. The quantity of interest is

P [y = 1jx = 1] =P [y = 0jx = 1]

and there are M = 2 models under consideration. We’ll suppose that joint
probabilities for (x; y) are as given in the tables below for the two models for
some p 2 (0; 1) and r 2 (0; 1)

Model 1 Model 2
ynx 0 1 ynx 0 1
1 :25 :25 1 (1 r) =2 r=2
0 (1 p) =2 p=2 0 :25 :25

so that under Model 1, the quantity of interest is :5=p and under Model 2, it is
r=:5. Suppose that under both models, training data (xi ; yi ) for i = 1; : : : ; N are
iid. For priors, suppose that in Model 1 a priori p Beta(2; 2) and suppose that
in Model 2 a priori r Beta(2; 2). Further, suppose that the prior probabilities
of the two models are (1) = (2) = :5.
Find the posterior probabilities of the 2 models, (1jT ) and (2jT ) and the
Bayes model average squared error loss predictor of P [y = 1jx = 1] =P [y = 0jx = 1].
(You may think of the training data as summarized in the 4 counts N(x;y) =number
of training vectors with value (x; y).)

3. (6E1-13) Consider a simple Bayes model averaging prediction problem with


iid training data (xi ; yi ) where xi 2 f0; 1g and we assume that yi = (xi ) + "i
for "i N (0; 1). Two models are contemplated. Model 1 says that (0) =
2
(1) = and a priori N 0; (10) . Model 2 says that (0) and (1)
2 2
are a priori independent with both (0) N 0; (10) and (1) N 0; (10) .
Assume that a priori the two models are equally likely. Training pairs (xi ; yi )
are (0; 5) ; (0; 7) ; (0; 6) ; (1; 12). Find an appropriate predicted value of y if x = 1.
You will …nd likely it helpful to recall that if conditioned on , observations
z1 ; : : : ; zn are iid N( ; 1) and is itself N 0; 2 , then conditioned on z1 ; : : : ; zn ,
n 1 1
is N n+ 12
z; n + 2

4. (6E1-15) Below are tables specifying two discrete joint distributions for
(x; y) that we’ll call Model 1 and Model 2. Suppose that N = 2 training cases

309
(drawn iid from one of the models) are (x1 ; y1 ) = (2; 2) and (x2 ; y2 ) = (3; 3).

Model 1 Model 2
y x 1 2 3 y x 1 2 3
3 0 :125 :125 3 0 0 :1
2 0 :125 :125 2 :1 :2 :1
1 :125 :125 0 1 :1 :2 :1
0 :125 :125 0 0 :1 0 0

Suppose further that prior probabilities for the two models are 1 = :3 and
2 = :7.
a) Find the posterior probabilities of Models 1 and 2.
b) Find the "Bayes model averaging" SEL predictor of y based on x for
these training data. (Give values f^ (1) ; f^ (2) ; and ^f (3).)

A.27 Section 11.2 Exercises


1. (6E2-11) The machine learning/data mining folklore is full of statements
like "combining uncorrelated classi…ers through majority voting produces a com-
mittee classi…er better than every individual in the committee." This is simply
not necessarily true. Consider the Vardeman and Morris scenario outlined in
the table below as regards the joint distribution of classi…ers f1 ; f2 ; and f3 and
a target (class variable) y taking values in f0; 1g.

a) Find the expected 0-1 loss for the individual classi…ers and for the "ma-
jority vote" classi…er. Note that the classi…ers are independent according to
this joint distribution.

310
b) Treat the vector of values of f1 ; f2 ; and f3 as "available data" and …nd
the conditional distributions of the vector given y = 0 and y = 1. What is in
fact the best function of these classi…ers in terms of expected expected 0-1 loss?
(Look again at Sections 1.4 and 1.5.) How does its error rate compare to the
error rates from a)?

A.28 Section 11.4 Exercises


1. (5E1-18) Again use the training set in Problem 3 of Section A.2 without
bothering to center y, consider using boosting to create a SEL predictor for it.
As your set of "basis functions for successive corrections" adopt the 10 indicator
functions
l1 (x) = I [x < :2] ; l2 (x) = I [x < :35] ; l3 (x) = I [x < :5] ; l4 (x) = I [x < :65] ;
l5 (x) = I [x < :8] ; u1 (x) = I [x > :2] ; u2 (x) = I [x > :35] ; u3 (x) = I [x > :5] ;
u4 (x) = I [x > :65] ; u5 (x) = I [x > :8]

Take f^0 (x) = y and using a "learning rate" of :5, …nd f^1 (x), the …rst boosted
iterate. (This will be f^0 (x) plus a multiple of one of the indicator functions.)

2. (6HW-11) (Izenman Problem 14.4.) Consider 2-class classifaction problem


with input space <2 and N = 10 observations in the table below.

y 1 1 1 1 1 2 2 2 2 2
x1 1 3:5 4:5 6 1:5 8 3 4:5 8 2:5
x2 4 6:5 7:5 6 1:5 6:5 4:5 4 1:5 0

Plot the (x1 ; x2 ) pairs on a scatterplot using di¤erent symbols or colors to


distinguish the two classes (1 and 2). Carry through the AdaBoost.M1 algorithm
on these points "by hand" for M = 4 iterations, showing the weights at each step
of the process. Determine the voting function and …nal classi…er and calculate
its training error rate.

3. (6E2-11) Find the M = 3 AdaBoost.M1 classi…er for the data of Problem


2 of Section A.23.

4. (6HW-13) Consider the famous Swiss Bank Note dataset. Use caret
train() to choose (via LOOCV) both AdaBoost.M1 and random forest 0-1
loss classi…ers based on these data. For a …ne grid of points indicate on a 2-d
plot which points get classi…ed to classes 1 and 1 so that you can make visual
comparisons.

5. (6HW-13) This problem concerns the "Seeds" dataset at the UCI Machine
Learning Repository. Standardize all p = 7 input variables before beginning
analysis.
a) Consider …rst the problem of classi…cation where only varieties 1 and 3
are considered (temporarily code variety 1 as 1 and variety 3 as +1) and use

311
only predictors x1 and x6 Use caret train() to choose (via LOOCV) both
AdaBoost.M1 and random forest 0-1 loss classi…ers based on these data. For
a …ne grid of points in [ 3; 3] [ 3; 3], indicate on a 2-d plot which points get
classi…ed to classes 1 and 1 so that you can make visual comparisons.
b) The paper "ada: An R Package for Stochastic Boosting" by Culp, John-
son, and Michailidis that appeared in the Journal of Statistical Software dis-
cusses using a one-versus-all strategy to move AdaBoost to a multi-class problem
known as the "AdaBoost.MH" algorithm. Continue the use of only predictors
x1 and x6 and …nd both an appropriate random forest classi…er and an Ad-
aBoost.MH classi…er for the 3-class problem with p = 2, and once more show
how the classi…ers break the 2-d input space up into regions of constant classi-
…cation.
c) How much better can you do at the classi…cation task using a random
forest classi…er based on all p = 7 input variables than you are able to do in
part b)? (Use LOOCV error rate to make your comparison.)

6. (6E2-13) Below is a toy K = 2 class training set for N = 4. Carry out ("by
hand") enough steps of the AdaBoost.M1 algorithm (…nd a number of iterations
M large enough) to produce a voting function with 0 training error rate. Plot
this function and indicate on the x axis which regions call for classi…cation to
the y = 1 class.
y 1 1 1 1
x 1 2 3 5

7. (5HW-14) Return to the context of Problem 4 of Section A.23.


a) Use the gbm package in R to …t several boosted regression trees to the
training set (use at least 2 di¤erent values of tree depth with at least 2 di¤erent
values of learning rate). What are values of training error and then test error
based on the size 5000 test set for these?
b) How do predictors in Problem 4 of Section A.23, Problem 2 of Section
A.25, and here compare in terms of test error? Evaluate y^ for each of the …rst
5 cases in your test set for all predictors and list all of the inputs and each of
the predictions in a small table.
c) Call your predictor from Problem 2 of Section A.25 f^1 and pick one of
your predictors from a) to call f^2 . Use your test set and approximate

E y f^1 (x) ; E y f^2 (x) ; Var y f^1 (x) ; Var y f^2 (x) ;

and Corr y f^1 (x) ; y f^2 (x)

(these expectations are across the joint distribution of (x; y) for the …xed training
set (and randomization for the random forest). Identify an approximately
optimizing
2
E y f^1 (x) + (1 ) f^2 (x)

312
Is the optimizer 0 or 1 (i.e. is the best linear combination of the two predictors
one of them alone)?

8. (6E1-19) Below is a toy p = 2 training set with N = 6. (The 6 values of


y are plotted near x = (x1 ; x2 ) locations corresponding to their input vectors.)
Consider SEL boosting using "2-split SEL regression trees" (trees with 3 …nal
nodes) as base predictors. (Two splits are made to produce each e^m (x).)

a) Beginning with f^0 (x) 7 and the …rst split of iteration 1 (for making
e^1 (x)) as indicated on the left …gure, draw in the 2nd split. Using it and a
= :5 learning rate, place the N = 6 values yi f^1 (xi ) onto the right …gure.
On that, mark the 2 cuts for creating e^2 (x).
b) Then, again using a = :5 learning rate and now your e^2 (x) implied by
the 2 cuts on the right …gure above, below show the regions on which f^2 (x) is
constant and indicate the values of f^2 (x) in those regions.

9. (6E1-17) Suppose that in a toy 2-class classi…cation model with p = 1 using


the y 2 f 1; 1g coding one has N = 5 training cases in the small table below.

y 1 1 1 1 1
x 1:5 :5 :5 1:5 2:5

In a gradient boosting exercise with the hinge loss


5
X
(1 yi g (xi ))+
i=1

313
and base functions I [x < c] and I [x > c] 8c, suppose that one has a current
function version gm (x) = 3x. Derive the function gm+1 (x).

10. (6E1-15) Consider the p = 2 prediction problem based on N = 9 training


points as below.
0 1 0 1
8 1 1
B 3 C B 1 0 C
B C B C
B 3 C B 1 1 C
B C B C
B 5 C B 0 1 C
1 B C B C
Y =p B 1 C and X = (x1 ;x2 ) = p1 B 0 0 C
B
6B 5 CC 6B C
B 0 1 C
B C B C
B 1 C B 1 1 C
B C B C
@ 3 A @ 1 0 A
5 1 1

a) Find the SEL lasso coe¢ cient vector ^ optimizing SSE+8 ^lasso + ^lasso
1 2
lasso
and give the corresponding Y^ .
b) "Boost" your lasso SEL predictor from a) using ridge regression with
= 1 and a learning rate of = :1. Give the resulting vector of predictions
b o ost1
Y^ .
c) Why is it clear that the predictor in b) is a linear predictor? What is ^
b o ost1
such that Y^ = X ^?
d) Now "boost" your SEL lasso predictor from a) using a best "stump"
regression tree predictor (one that makes only a single split) and a learning rate
b o ost2
of = :1. Give the resulting vector of predictions Y^ .

11. (6E2-15) The AdaBoostM.1 classi…cation algorithm is essentially an ap-


plication of general gradient boosting to exponential loss and basic function
updates that are simple "binary stumps." This problem concerns applying the
algorithm with hinge loss, L (^y ; y) = [1 y y^]+ (for the 1 and 1 coding for y
and y^ 2 <), and linear functions of predictor x 2 <p , say 0 + x0 , as basic
function updates. (Of course, since linear combinations of linear functions are
linear, this can only produce a best linear voting function.)
a) What starting function f0 (x) would be used?
b) With the (m 1) iterate fm 1 (x) in hand, each y~im is in f 1; 1g. Using
appropriate indicator functions, give an explicit formula for y~im in terms of yi
and y^im 1 = fm 1 (xi ).
c) Describe in words how you would use standard statistical software to
produce 0m and m so that all 0m + x0i m approximate the values y~im .
P
N
d) Why does optimization of [1 yi (fm 1 (xi ) + ( 0m +x0i m ))]+ over
i=1
choices of involve comparison of this quantity for at most N values of ? Give
a formula for values of that you might have to check.

314
e) After M iterations you won’t have an fM (x) taking only values 1 and
1 at every xi . How do you use fM (x) to do classi…cation?

12. (5HW-16) Use R and make a simple set of boosted predictions of home
price for the dataset of Problem 5 Section A.2 by …rst …tting a "default" random
forest (using randomForest), then correcting a fraction = :1 of the residu-
als predicted using a 7-nn predictor, then correcting a fraction = :1 of the
residuals predicted using a 1 component PLS predictor. Then permute the
orders in which you make these corrections and compare SSE for the 6 di¤erent
possibilities.

13. (5E2-14) Below are hypothetical counts from a small training set in a
2-class classi…cation problem with a single input, x 2 < (and we’ll treat x as
integer-valued). Although it is easy to determine what an approximately opti-
mal (0-1 loss) classi…er is here, instead consider use of the AdaBoost.M1 algo-
rithm to produce a classi…er. (Use "stumps"/two-node trees that split between
integer values as basis functions.) Find an M = 3 term version of the Ad-
aBoost.M1 voting function. (Give f^1 ; 1 ; f^2 ; 2 ; f^3 ; and 3 . The f^m s are of the
P3
form sign(x #) or sign(# x) and the …nal voting function is m=1 m ; f^m .)

x=1 x=2 x=3


y=1 3 5 2
y= 1 5 4 6

14. (5E1-20) In a toy p = 1 SEL prediction problem, the table in Problem 28 of


the Section A.2 provides N = 5 training cases. An initial predictor f^0 (x) 0 is
boosted using simple linear regression and a learning rate of = 1=3 to produce
the predictor f^1 (x) = :5x. This problem is about making 2 more "boosting"
steps to produce f^3 (x).
a) Make a 1-nn "boosting" correction to f^1 (x) with learning rate = :5 to
produce f^2 (x) = f^1 (x) + :5^
e2 (x). (Give a formula/expression for e^2 (x), a step
function constant on 5 consecutive intervals.)
b) Find values for f^2 (x) at x = 2; 1; 0 1; 2. Then consider a "regression
tree with a single split" "boosting" correction to this predictor. Choose from
values f 1:5; :5; :5; 1:5g for the location of your split (justifying your choice)
and then give a formula for e^3 (x) (a step function taking 2 values).

A.29 Section 12.1 Exercises


1. (6HW-11) Figure 4.4 of HTF gives a 2-dimensional plot of the "vowel
training data" (available on the book’s website at
http://www-stat.stanford.edu/~tibs/ElemStatLearn/index.html or from
the UCI data repository. The ordered pairs of …rst 2 canonical variates are
plotted to give a "best" reduced rank LDA picture of the data like that below
(lacking the decision boundaries).

315
Use the material of Section 12.1 to reproduce Figure 4.4 of HTF (color-coded
by group, with group means clearly indicated). Keep in mind that you may need
to multiply one or both of your coordinates by 1 to get the exact picture.

2. (6E2-11) Suppose that in a p = 2 linear discriminant analysis problem, four


1 0 4
transformed means k = 2 (
k ) are 1 = ; 2 = ; 3 =
0 4
3:5 :5
; and 4 = . These have sample covariance matrix
1:5 2:5
! !
3:125 1:625 p1 p1 p1 p1
= 2 2 diag (4:75; 1:5) 2 2
1:625 3:125 p1 p1 p1 p1
2 2 2 2

Suppose that one wants to do reduced rank (rank = 1) linear discrimination


based on a single real variable
1
w = (u1 ; u2 ) 2 (x )

Identify an appropriate vector (u1 ; u2 ) and with your choice of vector, give the
function f (w) mapping < ! f1; 2; 3; 4g that de…nes this 4-class classi…er for the
case of 1 = 2 = 3 = 4 .

3. (6E2-13) In a 6-class, p = 3 linear discriminant problem with equal class


probabilities ( 1 = 2 = 3 = 4 = 5 = 6 ), unit eigenvectors correspond-
ing to the largest 2 eigenvalues of the sample covariance matrix of the sphered
(according to the common within-class covariance matrix) class means are re-
spectively
0
1 1 0
v 1 = p ; 0; p and v 2 = (0; 1; 0)
2 2
Suppose that inner product pairs (h k ;v 1 i ; h k ;v 2 i) for the sphered class means
are as below and that reduced rank (rank = 2) linear classi…cation is of interest.
0
How should a sphered p = 3 observation x = (3; 4; 5) be classi…ed?

Class 1 2 3 4 5 6
Inner Product Pair (5; 0) ( 5; 0) (0; 3) (0; 3) (0; 0) (0; 0)

4. (6HW-17) Use the "Glass Identi…cation" dataset referred to in Problem 6


of Section A.2. Do the following with it, not using the "problematic-looking"
inputs "Ba" and "Fe".
a) Use the lda() function in the MASS package and do LDA based on all
p = 7 inputs and …nd LOOCV 0-1 loss error rates for each type of glass and
overall.
b) Using the function stepclass() in the R package klaR (or otherwise)
use cross-validation to select a number of variables to use in linear discriminant

316
analysis for classi…cation among the 6 glass types. Then choose this number of
input variables by forward selection with the whole dataset. What are they?
c) Find the …rst 2 canonical coordinates for all 215 cases in the dataset. Plot
N = 215 ordered pairs of these using di¤erent plotting symbols for the K = 6
glass types. Overlay on this plot classi…cation regions based on LDA with these
…rst 2 canonical coordinates. Make a plot analogous to the plot in Figure 4.11
of HTF. (You may simply di¤erently color points on a …ne grid according to
which glass such a point would be classi…ed to.)

A.30 Section 12.2 Exercises


1. (6HW-11) Consider again the context of Problem 1 of Section A.29.
a) Use the R function lda (in the MASS package) to obtain the group means
and coe¢ cients of linear discriminants for the vowel training data. Save the lda
object by a command such as LDA=lda(insert formula, data=vowel).
b) Reproduce a version of the left …gure below. You will need to plot
the …rst two canonical coordinates as in Problem 1 of Section A.29. Decision
boundaries for this …gure are determined by classifying to the nearest group
mean. Do the classi…cation for a …ne grid of points covering the entire area of
the plot. You may plot the points of the grid with color coding according to
their classi…cation instead of drawing in the black lines.

c) Make a version of the right …gure above with decision boundaries now
determined by using logistic regression as applied to the …rst two canonical vari-
ates. You will need to create a data frame with columns y; canonical variate
1; and canonical variate 2. Use the vglm function (in the VGAM package) with
family=multinomial() to do the logistic regression. Save the object created
by a command such as LR=vglm(insert formula, family=multinomial(),
data=data set). A set of observations can now be classi…ed to groups by us-
ing the command predict(LR, newdata, type=“response”), where newdata

317
contains the observations to be classi…ed. The outcome of the predict function
will be a matrix of probabilities. Each row contains the probabilities that a cor-
responding observation belongs to each of the groups (and thus sums to 1). We
classify to the group with maximum probability. As in b), do the classi…cation
for a …ne grid of points covering the entire area of the plot. You may again plot
the points of the grid, color-coded according to their classi…cation, instead of
drawing in the black lines.
d) So that you can plot results, …rst use the 2 canonical variates employed
thus far and use rpart in R to …nd a classi…cation tree with training error
rate comparable to the reduced rank LDA classi…er pictured on the left above.
Make a plot showing the partition of the region into pieces associated with the
11 di¤erent classes. (The intention here is that you show rectangular regions
indicating which classes are assigned to each rectangle, in a plot that might be
compared to the plots above and from Problem 1 of Section A.29.)
e) The Culp, Johnson, and Michailidis paper referred to in Problem 5 of
Section A.28 discusses using a one-versus-all strategy that moves AdaBoost to
a multi-class problem known as the "AdaBoost.MH" algorithm. Continue the
use of the …rst two canonical coordinates of the vowel training data and …nd
both an appropriate random forest classi…er and an AdaBoost.MH classi…er for
the 11-class problem with p = 2, and once more show how the classi…ers break
the 2-d space up into regions to be compared to other plots here.
f ) Beginning with the original vowel dataset (rather than with the …rst 2
canonical variates) and use rpart in R to …nd a classi…cation tree with training
error rate comparable to the classi…er in d). How much (if any) simpler/smaller
is the tree here than in d)?

2. (6HW-13) Consider again the Swiss Bank Note dataset of Problem 4 of


Section A.28. Use caret train() to choose (via LOOCV using glmnet) a
logistic regression-based 0-1 loss classi…er based on these data. Compare its
training set and cross-validation error rates to what you found in Section A.28.

3. (5E2-14) Overall, only a very small fraction of people presented with a


certain merchandising o¤er will respond to it. A set of 5 qualitative predictors
(potential personal traits) is thought to be related to response. Values for these
5 predictors are obtained from a group of 96 people who responded to the o¤er
and from a group of 604 who did not. Treating the input xj as taking the value
1 if a subject has trait j and 0 otherwise, a model for

p (x) = probability of responding to the o¤er given characteristics x

of the form
p (x)
log = 0 + 1 x1 + 2 x2 + 3 x3 + 4 x4 + 5 x5
1 p (x)
was …t (via maximum likelihood) to the 700 training cases yielding results
^0 = 3:42; ^1 = 0:41; ^2 = 1:76; ^3 = 0:03; ^4 = 0:13; ^5 = 2:09

318
a) Treating the 700 subjects (that were used to …t the logistic regression
model) as a random sample of people of interest (which it is surely not) give a
linear function g (x) such that f^ (x) = I [g (x) > 0] is an approximately optimal
(0-1 loss) classi…er (y = 1 indicating response to the o¤er).
b) Continuing with the logistic regression model, properly adjust your an-
swer to a) to provide an approximately optimal (0-1 loss) classi…er for a case
where a fraction 1 = 1=1000 of all potential customers would respond to the
o¤er.

A.31 Section 13.1 Exercises


1. (6E1-11) Below is a small classi…cation training set (for K = 2 classes)
displayed in graphical and tabular forms (circles are class 1 and squares are
class 1). Using geometry (not trying to solve an optimization problem analyt-
ically) …nd the maximum margin classi…er for this problem. You may …nd it
helpful to know that if u; v; and w points in <2 and u1 6= v1 then the distance
from the point w to the line through u and v is

jw1 (v2 u ) w2 (v1 u1 ) + v1 u2 u1 v2 j


q2
2 2
(v1 u1 ) + (v2 u2 )

List the set of support vectors and evaluate the margin for your classi…er.

A.32 Section 13.2 Exercises


1. (6HW-11) Consider again the Wisconsin breast cancer dataset of Problem
24 of Section A.2. Compare an appropriate support vector classi…er (SVM
with "linear kernel") based on the original input variables to a classi…er based
on logistic regression using the same variables. (Use caret train() to identify
"best" versions of these classi…ers in terms of LOOCV misclassi…cation rates.)

319
2. (5E2-14) Below is a cartoon representing the results of 3 di¤erent runs of
support vector classi…cation software on a set of training data representing K =
3 di¤erent classes in a problem with input space <2 . Each pair of classes was
used to produce a linear classi…cation boundary for classi…cation between those
two. (Labeled arrows tell which sides of the lines correspond to classi…cation
to which classes.) 7 di¤erent regions are identi…ed by Roman numerals on the
cartoon. Indicate values of an OVO (one-versus-one) classi…er f^OVO for this
situation. (For each region, identify decisions 1,2, or 3, or "?" if there is no
clear choice for a given region.)

A.33 Section 13.3 Exercises


1. (6HW-11) This problem concerns the famous p = 2 "Ripley dataset"
(synth.tr) commonly used as a classi…cation example.
a) Using several di¤erent values of and constants c, …nd a function g 2 A
and 0 2 < minimizing
N
X 2
(1 yi ( 0 + g (xi )))+ + kgkA
i=1

2
for the (Gaussian) kernel K (x;z) = exp c kx zk . Make contour plots for
those functions g, and, in particular, show the g (x) = 0 contour that separates
[ 1:5; 1:0] [ :2; 1:3] into the regions where a corresponding SVM classi…es to
classes 1 and 1.
b) Have a look at the Culp, Johnson, and Michailidis paper referred to in
Problem 5 of Section A.28. It provides perspective and help with both the
ada and randomForest packages. Find both AdaBoost.M1 and random forest
classi…ers appropriate for the Ripley example. For a …ne grid of points in
[ 1:5; 1:0] [ :2; 1:3], indicate on a 2-d plot which points get classi…ed to classes
1 and 1 so that you can make visual comparisons to the SVM classi…ers referred
to in a).

2. (6E2-11) In what speci…c way(s) does the use of kernels and SVM method-
ology typically lead to identi…cation of a small number of important features
(basis functions) that are e¤ective in 2-class classi…cation problems?

320
3. (6HW-13) Consider again the Swiss Bank Note dataset of Problem 4 of
Section A.28. Use caret train() to choose (via LOOCV) to choose a SVM
based on Gaussian kernel and 0-1 loss for these data. Compare its training set
and cross-validation error rates to what you found in the Problem 4 of Section
A.28 and Problem 2 of Section A.30.

4. (6HW-13) Repeat part a) of Problem 1 above on the Seeds data of Problem


5 Section A.28. (Do the plotting on [ 3; 3] [ 3; 3].)

5. (6E2-13) Consider again the toy classi…cation scenario of Problem 6 in


Section A.28.
a) Is there a linear classi…er based directly/only on x with err = 0? Explain.
b) Is there a support vector machine classi…er based on the kernel K (x; z) =
2
(1 + xz) with err = 0? Explain.
c) Is there a support vector machine classi…er based on the kernel K (x; z) =
2
exp 2 (x z) that has err = 0? Explain.

6. (5HW-14) Return to the context of Problem 7 Section A.2, Problem 1


Section A.5, and Problem 5 in Section A.23 and the N = 400 training set and
large test set.
a) Apply linear discriminant analysis to the training set. Identify the regions
2
in (0; 1) corresponding to the values of y^ = 1; 2; 3; 4. Evaluate the (conditional
on the training set) test error rate for LDA based on this training set.
b) Use logistic regression (e.g. as implemented in glm() or glmnet()) on
the training data to …nd 6 classi…ers with linear boundaries for choice between
all pairs of classes. Then consider an OVO classi…er that classi…es x to the class
with the largest sum (of 3) estimated probabilities coming from these logistic
2
regressions. Make a plot showing the regions in (0; 1) where this classi…er has
f^ (x) = 1; 2; 3;and 4. Use the large test set to evaluate the (conditional on the
training set) error rate of this classi…er.
c) It seems from the glmnet() documentation that using
family="multinomial" one can …t multivariate versions of logistic regression
models. Try this using the training set. Consider the classi…er that classi…es
x to the class with the largest estimated probability. Make a plot showing
2
the regions in (0; 1) where this classi…er has f^ (x) = 1; 2; 3; and 4. Use the
large test set to evaluate the (conditional on the training set) error rate of this
classi…er.
d) Pages 360-361 of K&J indicate that upon converting output y taking
values in f1; 2; 3; 4g to 4 binary indicator variables, one can use nnet with the
4 binary outputs (and the option linout = FALSE) to …t a single hidden layer
neural network to the training data with predicted output values between 0 and
1 for each output variable. Try several di¤erent numbers of hidden nodes and
"decay" values to get …tted neural nets. From each of these, de…ne a classi…er
that classi…es x to the class with the largest predicted response. Use the large
test set to evaluate (conditional) error rates of these classi…ers and pick the one

321
2
with the smallest. Make a plot showing the regions in (0; 1) where your best
neural net classi…er has f^ (x) = 1; 2; 3; and 4.
e) Use svm() in package e1071 to …t SVMs to the y = 1 and y = 2 training
data for the

"linear" kernel,
"polynomial" kernel (with default order 3),
"radial basis" kernel (with default gamma, half that gamma value, and
twice that gamma value)

Use the plot() function to investigate the nature of the 5 classi…ers. Put
the training data pairs on the plot using di¤erent symbols or colors for classes
1 and 2, and also identify the support vectors.
f ) Find SVMs (using the kernels indicated in d)) for the K = 4 class prob-
lem. Again, use the plot() function to investigate the nature of the 5 classi…ers.
Use the large test set to evaluate the (conditional) error rates for these 5 clas-
si…ers.
g) Use either the ada package or the adabag package and …t an AdaBoost.M1
classi…er to the y = 1 and y = 2 training data. Make a plot showing the regions
2
in (0; 1) where this classi…er has f^ (x) = 1 and 2. Use the large test set
to evaluate the conditional error rate of this classi…er. How does this error
rate compare to the best possible one for comparing classes 1 and 2 with equal
weights on the two? (You should be able to get the latter analytically.)
h) It appears from the Culp, Johnson, and Michailidis paper referred to in
Problem 5 of Section A.28 that ada implements a OVA version of a K-class
AdaBoost classi…er in R. Use this and …nd the corresponding classi…er. Make
2
a plot showing the regions in (0; 1) where this classi…er has f^ (x) = 1; 2; 3; and
4. Use the large test set to evaluate the conditional error rate of this classi…er.

7. (5E2-14) Consider a 2-class 0-1 loss classi…cation problem with f 1; 1g


coding of y. For input x 2 <2 and a parameter > 0, based on a training set
of size N consider the classi…er
8 P 2 P 2
>
> 1 if exp kx xi k > exp kx xi k
>
>
>
<
i with
yi =1
i with
yi = 1
f^ (x) =
>
> P P
>
> 1 if exp
2
kx xi k < exp kx xi k
2
>
: i with i with
yi =1 yi = 1

a) On what basis might one expect that for large N this classi…er is approx-
imately optimal?
b) For what "voting function" g (x) is f^ (x) =sign(g (x))? Is this g (x) a
linear combination of radial basis functions?
c) Why will f^ (x) typically not be of the form of a support vector machine
based on a Gaussian kernel?

322
A.34 Section 14 Exercises
1. (6E2-11) The reduced rank classi…er of Problem 2 of Section A.29 can be
thought of as a "prototype classi…er." Give 4 prototypes (real numbers) that
can be though of as de…ning the classi…er.

2. (6HW-13) Return to the context of Problem 5 of Section A.28 and the


Seeds dataset. Use the K-means method in the R stats package (or some other
equivalent method) to …nd K (7-dimensional) prototype vectors for representing
each of the 3 wheat varieties for each of K = 5; 7; 10. Then compare training
error rates for classi…ers that classify to the variety with the nearest prototype
for these values of K.

3. (6HW-17) Consider again the Wisconsin breast cancer dataset of Problem


24 of Section A.2. In what follows use standardized versions of the p = 9 inputs.
a) For both malignant cases and (separately) benign cases, use K-means
clustering and by considering error sums of squares across the inputs, identify
small values of K beyond which more clusters are "not essential" in represent-
ing the data cases. Make parallel coordinates plots (if you aren’t familiar with
these, see e.g. https://datascience.blog.wzb.eu/2016/09/27/parallel-
coordinate-plots-for-discrete-and-categorical-data-in-r-a-comparison/)
for the Km alignant and Kb enign mean vectors produced. (Use the same vertical
scales on the two plots so that you can compare them.)
b) Using the Km alignant + Kb enign mean vectors identi…ed part a) as proto-
types, classify the cases in the dataset according to whether the closest proto-
type represents a malignant or a benign case. What are the training error rates
(malignant, benign, and overall)?
c) Follow the LVQ algorithm as outlined in the exposition for 1000 itera-
tions beginning from the Km alignant + Kb enign mean vectors identi…ed part a)
m 1
as prototypes. Use a series of learning rates "m = :1 (:999) . Then classify
the cases in the dataset according to whether the closest prototype represents
a malignant or a benign case. What are the training error rates (malignant,
benign, and overall)?

A.35 Section 15.2 Exercises


1. (6HW-11) Let A be the set of absolutely continuous functions on [0; 1] with
square integrable …rst derivatives (that exist except possibly at a set of measure
0). Equip A with an inner product
Z 1
hh; giA = h (0) + g (0) + h0 (x) g 0 (x) dx
0

a) Show that
R (x; z) = 1 + min (x; z)
is a reproducing kernel for this Hilbert space of functions.

323
b) Using Heckman’s development, describe as completely as possible
N Z 1 !
X 2 2
0
arg min (yi h (xi )) + (h (x)) dx
h2A i=1 0

c) Using Heckman’s development, describe as completely as possible


N Z xi Z 1 !
X 2
2
0
arg min yi h (t) dt + (h (x)) dx
h2A i=1 0 0

2. (6HW-15) In the context of Problem 1 above, consider the toy dataset


below.
y 1:1 1:5 2:4 2:2 1:7 1:3 :3 :1 :1 :5 :1
x 0 :1 :2 :3 :4 :5 :6 :7 :8 :9 1:0

a) For two di¤erent values of > 0 …nd the optimizing function h 2 A for
the criterion in part b) of Problem 1.
b) For two di¤erent values of > 0 …nd the optimizing function h 2 A for
the criterion in part c) of Problem 1.

A.36 Section 15.3 Exercises


2
1. (6E2-15) Consider the Gaussian kernel K (x; z) = exp (x z) for x
and z in [ 2; 4] and a corresponding RKHS, A. Based on the very small (x; y)
training set
y 4 4 3 3 2
x 1 0 1 2 3
we wish to …t a function of the form f^ (x) = 0 + 1 x + h (x) for h 2 A under
the …tting criterion
X 5 2
2
yi f^ (xi ) + 2 khkA
i=1

You may use the fact that the least squares line through these data pairs is
y^ = 3:7 :5x. Find the optimizing f^ (x).

2. (6HW-17) Return to the baseball home run dataset of Problem 7 of


Section A.13 (treating the year index as "x"). Consider the two kernel func-
2 2
tions K1 (x; z) = exp :5 (x z) and K2 (x; z) = exp (x z) and the
corresponding RKHSs (say A1 and A2 ). For = 1 and = 10 …nd coe¢ cients
0 ; 1 ; and 2 and function h 2 A minimizing

N
X
2 2 2
yi 0 + 1x + 2x + h (x) + khkA
i=1

324
(There are 4 di¤erent optimizations intended here for the two kernels and two
values of .) Plot the 4 resulting functions
2
0 + 1x + 2x + h (x)

on a single set of axes, together with the 145 original (x; y) data points. (If
this is computationally infeasible, you may reduce the size of the problem by
considering only the "last N " years in the dataset, where N doesn’t break your
computer.)

A.37 Section 15.4 Exercises


1. (6HW-11) Center the outputs for the dataset of Problem 1 of Section
A.18. Then derive sets of predictions y^i based on (x) 0 Gaussian process
priors for f (x). Plot several of those as functions on the same set of axes (along
with centered original data pairs) as follows:
a) Make one plot for cases with 2 = 1; ( ) = exp c 2 ; 2 = 1; 4; and
c = 1; 4.
b) Make one plot for cases with 2 = 1; ( ) = exp ( c j j) ; 2 = 1; 4; and
c = 1; 4.
c) Make one plot for cases with 2 = :25, but where otherwise the parameters
of a) are used.

2. (6HW-11) Consider again the situation of Problem 1 Section A.16. Center


the outputs and then derive a set of predictions y^i based on a (x) 0 prior
2 2
for f (x). Use 2 = (:02) , (x z) = exp 2 kx zk , and 2
= :25. How do
these compare to the ones you made in Section A.16?

3. (6HW-13) Consider again the situation of Problem 2 Section A.16. Center


the outputs and then derive a set of predictions y^i based on a (x) 0 prior
2
for f (x). (Use (x z) = exp c kx zk and what seem to you to be
appropriate values of c; 2 ; and 2
.) How do your predictions compare to the
ones you made in Section A.16?

4. (6HW-15) Consider again the situation of Problem 17 Section A.2. Center


the outputs and then derive a set of predictions y^i based on a (x) 0 prior
for f (x). Plot several of those as functions on the same set of axes (along with
centered original data pairs) as follows:
a) Make one plot for cases with what appear to you to be a sensible choice
of 2 , for ( ) = exp c 2 , 2 = 2 ; 4 2 , and c = 1; 4.
b) Make one plot for cases with ( ) = exp ( c j j) and the choices of
parameters you made in a).
c) Make one plot for cases with 2 one fourth of your choice in a), but where
otherwise the parameters of a) are used.

325
A.38 Section 17.1 Exercises
1. (6HW-13) There is a small fake dataset below. It purports to be a record
of 20 transactions in a drugstore where toothpaste, toothbrushes, and shaving
cream are sold. Assume that there are 80 other transaction records that include
no purchases of any toothpaste, toothbrush, or shaving cream.

a) Find I :02 (the collection of item sets with support at least :02).
b) Find all association rules derivable from rectangles in I :02 with con…dence
at least :5.
c) Find the association rule derivable from a rectangle in I :02 with the largest
lift.

2. (5E2-14) In a toy transaction database there are 5 transactions with items


from the set of letters A through G. These are:
Transaction Number Items Included
1 A,B,D,G
2 B,C,E,G
3 A,C,D,F
4 C,D,E,G
5 A,B,C,G

a) Find all item sets of support at least :4.


b) For the 3-item set with the largest support, what are the con…dence,
expected con…dence and lift of the associated conjunctive rules?

A.39 Section 17.2 Exercises


1. (6HW-13) Work again with the Seeds data of Problem 5 Section A.28. Be-
gin by again standardizing all p = 7 measured variables. JMP will do clustering

326
for you. (Look under the Analyze->Multivariate menu.) In particular, it will
do both hierarchical and K-means clustering, and even self-organizing mapping
as an option for the latter. Consider
i) several di¤erent K-means clusterings (say with K = 9; 12; 15; 21),
ii) several di¤erent hierarchical clusterings based on 7-d Euclidean distance
(say, again with K = 9; 12; 15; 21 …nal clusters), and
iii) SOMs for several di¤erent grids (say 3 3; 3 5; 4 4; and 5 5).
Make some comparisons of how these methods break up the 210 data cases
into groups. You can save the clusters into the JMP worksheet and use the
GraphBuilder to quickly make plots. If you "jitter" the cluster numbers and
use "variety" for both size and color of plotted points, you can quickly get a
sense as to how the groups of data points match up method-to-method and
number-of-clusters-to-number-of-clusters (and how the clusters are or are not
related to seed variety). Also make some comparisons of the sums squared
Euclidean distances to cluster centers.

2. (6HW-13) A p = 2 dataset (that has N = 200 cases of (x1 ; x2 ) pairs) with


"obvious graphical structure" is provided with these notes. Plot these 200 pairs
and see that there are somehow 4 di¤erent kinds of "structure" in the dataset.
Apply the "graphical spectral features" idea (use w (d) = exp d2 =c ) and see
if you can "…nd" the 4 structures in the dataset (by appropriate choice of c and
using hierarchical clustering of 200 vectors of 4 or fewer dimensions).

3. (6E2-13) Give a p = 1 dataset of size N = 4 that shows that the result


of ordinary K-means clustering can depend upon the starting cluster centers.
(List the 4 data values, consider the 2-cluster problem, and give two di¤erent
pairs of starting centers that produce di¤erent …nal clusterings. Your starting
centers do not need to be data points.)

4. (6E2-13) Below is a toy proximity matrix for N = 6 items. Show the


steps of agglomerative hierarchical clustering (from 5 to only 2 clusters) using
both single and complete linkage. (At every stage, list the clusters as subsets
of f1; 2; 3; 4; 5; 6g. In case of "ties" at any step, pick any of the equivalent
possibilities.) 0 1
0 1 1 1:41 1:41 1:74
B 1 0 1:40 1:01 1:73 1:41 C
B C
B 1 1:40 0 1:72 1:01 1:41 C
B C
B 1:41 1:01 1:72 0 1:40 1 C
B C
@ 1:41 1:73 1:01 1:40 0 1 A
1:74 1:41 1:41 1 1 0

5. (5HW-14) Apply model-based clustering to the "USArrests" data in basic


R using the mclust package and interpret your results.

6. (6HW-17) Consider again the "Glass Identi…cation" dataset of Problem 6 of


Section A.2. Use mclust (for Gaussian model-based clustering) and hclust (for

327
hierarchical clustering) using average linkage to cluster the 215 glass samples on
the basis of the 7 (standardized) inputs used there. For the case of 6 clusters
from each method, make a table giving counts of cases in a given cluster from
mclust and a given cluster from hclust. Then compute the "Rand index" for
comparing clusterings (look it up on Wikipedia).

7. (5E2-14) Below is a representation of a toy 9-point dataset with p =


1. Use agglomerative hierarchical clustering …rst with single linkage and then
with complete linkage to …nd K = 3 clusters in these values. List for each
agglomeration step all groups of more than one value: (You don’t need to
list every value in the dataset.)

8. (5HW-20) Consider the problem of clustering points x1 ; x2 ; x3 ; : : : ; xr


belonging to <p after transforming them to an abstract function space on <p
2
using the mapping T (x) ( ) = K (x; ) = exp kx k , where the function
space inner product for points mapped from <p is hT (x) ; T (z)iA = K (x;z).
Suppose that squared function-space distance is the dissimilarity measure used.
a) Describe agglomerative hierarchical clustering in enough detail that it
could be implemented from any formulas and instructions that you supply.
b) Describe K-means clustering in the function space in enough detail that
it could be implemented from any formulas and instructions that you supply.
(Notice that the concept of arithmetic average makes sense in any linear space,
including the abstract feature space.)

A.40 Section 17.3 Exercises


1. (6HW-13) Use appropriate R packages/functions and do multi-dimensional
scaling on the 210 cases of the Seeds dataset used in Problem 5 Section A.28,
mapping from <7 to <2 using Euclidean distances. Plot the 210 vectors z i 2 <2
using di¤erent plotting symbols for the 3 di¤erent varieties.

2. (6E2-13) Below is a toy proximity matrix for N = 4 items. If one should


want to map items to <1 in a way that makes distances between corresponding
points in <1 approximately equal to the dissimilarities in the matrix, there is
no loss of generality in assuming that the …rst item is mapped to z1 = 0. Say
why there is then no loss of generality to assume that that the second item is
mapped to a positive value, i.e. z2 > 0 and provide a suitable function of z2 ; z3 ;
and z4 that you would try to optimize in order to accomplish this task.
0 p 1
0 1 p1 2
B 1 0 2 1 C
B p C
@ 1 2 0 1 A
p
2 1 1 0

328
3. (6HW-17) The 10 countries in the world with the largest populations are
China, India, United States, Indonesia, Brazil, Pakistan, Nigeria, Bangladesh,
Russia, and Mexico. You can …nd the (great circle) distances between their
capital cities using this online calculator:
http://www.chemical-ecology.net/java/capitals.htm Use multi-dimensional
scaling to make a 2-d representation of these cities intended to more or less pre-
serve great circle distances. (The pattern at
http:/www.personality-project.org.html might prove helpful to you.)

A.41 Section 18.2.1 Exercises


1. (6E2-13) Below is a network diagram for a simple restricted Boltzmann
machine (with hidden nodes 1 and 2, and visible nodes 3 and 4).. Assume
the corresponding probability model for x = (x1 ; x2 ; x3 ; x4 ) has parameters
01 ; 02 ; 03 ; 04 ; 13 ; 14 ; 23 ; and 24 and that somehow the network has been
"trained" producing ^01 = ^02 = 1; ^03 = ^04 = 1; ^13 = ^14 = 1; and ^23 =
^24 = 1.

a) Find (for the …tted model) the ratio P [x = (1; 0; 1; 0)] =P [x = (0; 0; 0; 0)].
b) Find (for the …tted model) the conditional distribution of (x1 ; x2 ) given
that (x3 ; x4 ) = (0; 0). (You will need to produce 4 conditional probabilities.)

A.42 "General/Comprehensive" Exercises


1. (5HW-20) Consider the White Wines Dataset56 from the UCI Machine
Learning Data Repository
http://archive.ics.uci.edu/ml/datasets/Wine+Quality:
Consider SEL prediction of what can be learned about wine "quality" from
the 11 input variables. There are roughly 5000 cases in this dataset, and it is
about at the (size) limit of what is conveniently handled using R and an ordinary
laptop. (Other faster software like Python or MatLab and/or implementation
on a server or cluster may be required for bigger datasets with many machine
learning applications.)
5 6 The White Wines Dataset is not absolutely ideal as an example in that the response

variable can take only integer values 1 through 10 and is probably not really an interval-
level variable in the …rst place (being more ordinal in nature). For purposes of exercise we
will ignore these matters, and treat the quality rating as a measured numerical response and
consider prediction under SEL.

329
a) Find sets of best (according to LOOCV) predictions for the quality ratings
for

k-nn prediction
elastic net prediction
PCR prediction
PLS prediction
MARS prediction (implemented in earth)
regression tree prediction
random forest prediction
boosted trees prediction
Cubist prediction

Say what parameters you settled on for each method.


b) Make a scatterplot matrix for all 9 sets of prediction in a) plus the y
values and OLS predictions. Compute a correlation matrix for these 11 sets of
values and display this rounded to 2 decimal places.
c) Consider the problem of combining the 9 "basic" prediction methodolo-
gies employed in a) via stacking/generalized stacking/meta-prediction/super-
learning. There is nothing that says that the "good" sets of parameters you
developed for "individual" use of the prediction methods are in any sense "good"
choices if ultimately one is going to use the methods as elements of an "ensem-
ble." But for purposes of exercise here, we are not going to "redo" them, but
will take them as chosen. (We will here consider combining these through the
use of …rst OLS MLR and then through the use of a random forest made with
"default" parameters.)
Randomly break the White Wines dataset into 10 folds of sizes as nearly
equal as possible. For each fold and its remainder …t a predictor using the
remainder as a training set via each of the methods in a) (and the parameters
of the methods previously identi…ed) and use it to make predictions for cases in
the fold. Then

1. Use the remainder as a training set and the values of the 9 predictors (on
the remainder) as "features" in a MLR model (including intercept). Use
OLS to …t this to the outputs for the cases in the remainder.
2. Use the remainder as a training set and the values of the 9 predictors (on
the remainder) as "features" and …t a default random forest to the outputs
for the cases in remainder.
3. Apply the coe¢ cients from 1. to the 9 predictions to make an ensemble
prediction for each case in the fold.

330
4. Apply the random forest …t in 2. to the 9 predictions to make an ensemble
prediction for each case in the fold.
5. For both the predictions in 3. and 4. add the squared di¤erences between
outputs and predicted outputs across the fold.

6. Total the results of 5. across the 10 folds, divide by N , and take a square
root to get a "RMSPE" for the basic methods and parameters combined
through OLS and through a random forest.

Do the values "RMSPE" in 6. improve on what you have for the best of the
CVRMSPEs for the individual methods?
d) Discuss how you would get an "honest" CV assessment of likely perfor-
mance of the strategy of …rst …tting predictors using methods in a) obtaining
parameters from caret train() and then combining them via OLS MLR or
default random forest. Explain why the "RMSPE" values from c) are probably
too optimistic to serve the purpose here.

2. (5HW-20) Consider again the Glass Identi…cation dataset and 2-class


classi…cation problem of Problem 6 of Section A.2.
a) Use LOOCV to identify a good number of neighbors to use for k-nn
classi…cation (based on 0-1 loss) for the 2-class classi…cation problem.
b) Use LOOCV to identify a good classi…cation tree for 0-1 loss in the 2-class
classi…cation problem.
c) Use the OOB error and optimize a classi…cation random forest over choice
of both m and nm in for 0-1 loss in the 2-class classi…cation problem.
d) Use LOOCV to identify a good (single layer feed-forward) neural network
for classi…cation (optimize over both number of hidden nodes and weight decay)
based on 0-1 loss for the 2-class classi…cation problem.
e) Use LOOCV to identify a good elastic net penalized logistic regression
for 0-1 loss 2-class classi…cation.
f ) Use LOOCV to identify a good support vector classi…er (based on 0-1
loss) for the 2-class classi…cation problem. (That is, …nd a good SVM with
"linear kernel.")
g) Use as much LOOCV grid-searching as you can a¤ord (time-wise) to
identify a good support vector machine with "Gaussian kernel" (based on 0-1
loss) for the 2-class classi…cation problem.
h) Use as much LOOCV grid-searching as you can a¤ord (time-wise) to
identify a good number of iterations for an AdaBoost.M1 classi…er based on 0-1
loss for the 2-class classi…cation problem.
i) Use as much LOOCV grid-searching as you can a¤ord (time-wise) to
identify a good tree-boosting classi…er using XGBoost for 0-1 loss 2-class classi-
…cation.
j) Compare the classi…ers in parts a) through i) on the basis of

training error rates (for 0-1 loss)

331
the AUC criterion57 , and
cross-validation (and OOB) 0-1 loss error rates.

3. (5HW-20). Consider again the White Wines dataset of Problem 1 above,


but now the problem of predicting the class variable

y = I [y > 7]

Call a wine with rating 7 or better a "good" wine and this becomes a problem
of classi…cation of wines into "not good" and "good" ones.
a) Carry out the steps a) through i) in Problem 2 above (there referring to
the Glass-Identi…cation problem) for this wine classi…cation problem.
Consider the problem of combining basic classi…cation methodologies via
stacking/generalized stacking/meta-prediction/super-learning in the White Wines
classi…cation problem immediately above.
b) Use the outputs of your classi…ers developed in a) and the original input
variables (the 11 quality measures giving 9 + 11 "features" in total) as inputs to
a default random forest. (Where they are available, use estimated conditional
probabilities for class 2 rather than the classi…cation values assigned to the
training cases by the classi…ers.) What "training error rate" is produced for
0-1 loss? There is a nominal random forest "OOB error" rate associated with
your …nal "super-learner." Why should you NOT trust either of these numbers
as being indicative of the likely performance of the "tune 9 classi…ers and plug
their outputs into a default random forest" prediction methodology?
c) Say very clearly and carefully how (given plenty of computing power) you
would compute an honest assessment of the likely performance of the "super-
learner" described above.

5 7 You may, for example, use the pROC package to (plot the "ROC curve" and) compute this.

332

View publication stats

You might also like