Lecture 22: Bagging and Random Forest
Wenbin Lu
Department of Statistics
North Carolina State University
Fall 2019
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 1 / 35
Outlines
Bagging Methods
Bagging Trees
Random Forest
Applications in Causal Inference/Optimal Treatment Decision
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 2 / 35
Bagging Trees
Ensemble Methods (Model Averaging)
A machine learning ensemble meta-algorithm designed to improve the
stability and accuracy of machine learning algorithms
widely used in statistical classification and regression.
can reduce variance and helps to avoid overfitting.
usually applied to decision tree methods, but it can be used with any
type of method.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 3 / 35
Bagging Trees
Bootstrap Aggregation (Bagging)
Bagging is a special case of the model averaging approach.
Bootstrap aggregation = Bagging
Bagging leads to ”improvements for unstable procedures” (Breiman,
1996), e.g. neural nets, classification and regression trees, and subset
selection in linear regression (Breiman, 1994).
On the other hand, it can mildly degrade the performance of stable
methods such as K-nearest neighbors (Breiman, 1996).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 4 / 35
Bagging Trees
Basic Idea
Given a standard training set D of size n, bagging
generates B new training sets Di , each of size n0 , by sampling from D
uniformly and with replacement.
The B models are fitted using the above B bootstrap samples and
combined by averaging the output (for regression) or voting (for
classification).
This kind of sample is known as a bootstrap sample.
By sampling with replacement, some observations may be repeated in
each Di .
If n0 = n, then for large n, the set Di is expected to have the fraction
(1 − 1/e) ≈ 63.2% of the unique examples of D, the rest being
duplicates.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 5 / 35
Bagging Trees
Bagging Procedures
Bagging uses the bootstrap to improve the estimate or prediction of a fit.
Given data Z = {(x1 , y1 ), ..., (xn , yn )}, we generate B bootstrap
samples Z∗b
Empirical distribution P̂: putting equal probability 1/n on each
(xi , yi ) (discrete)
Generate Z∗b = {(x1∗ , yn∗ ), ..., (xn∗ , yn∗ )} ∼ P̂, b = 1, ..., B
Obtain fˆ∗b (x), b = 1, ..., B.
The Monte Carlo estimate of the bagging estimate
B
1 X ˆ∗b
fˆbag (x) = f (x).
B
b=1
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 6 / 35
Bagging Trees
Properties of Bagging Estimates
Advantages:
Note fˆbag (x) −→ EP̂ fˆ∗ (x) as B → ∞,
fˆbag (x) typically has smaller variance than fˆ(x);
fˆbag (x) differs from fˆ(x) only when the latter is nonlinear or adaptive
function of data.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 7 / 35
Bagging Trees
Bagging Classification Trees
In classification problems, there are two scenarios
(1) fˆb (x) is indicator-vector, with one 1 and K − 1 0’s (hard
classification)
(2) fˆ∗b (x) = (p1 , ..., pK ), the estimates of class probabilities
Pbb (Y = k|X = x), k = 1, · · · , K (soft classification). The bagged
estimates are the average prediction at x from B trees
B
bag X
fˆk (x) = B −1 fˆk∗b (x), k = 1, · · · , K .
b=1
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 8 / 35
Bagging Trees
Bagging Classification Trees (cont.)
There are two types of averaging:
(1) Use the majority vote: arg maxk B ˆb
P
b=1 I {f (x) = k}.
(2) Use the averaged probabilities. The bagged classifier
bag
Ĝbag (x) = arg max fˆk (x).
k
Note: the second method tends to produce estimates with lower variances
than the first method, especially for small B.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 9 / 35
Bagging Trees
Example
Sample size n = 30, two classes
p = 5 features, each having a standard Gaussian distribution with
pairwise correlation Corr(Xj , Xk ) = 0.95.
The response Y was generated according to
Pr(Y = 1|x1 ≤ 0.5) = 0.2, Pr(Y = 1|x1 > 0.5) = 0.8.
The Bayes error is 0.2.
A test sample of size 2, 000 was generated from the same population.
We fit
classification trees to the training sample
classification trees to each of 200 bootstrap samples
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 10 / 35
Bagging Trees
Elements of Statisti
al Learning
Hastie, Tibshirani & Friedman 2001 Chapter 8
epla
ements PSfrag repla
ements
Original Tree Bootstrap Tree 1
0
Original Tree 0
strap Tree 1 x.2<0.39
10/30
x.2<0.36
7/30
x.2>0.39 x.2>0.36
strap Tree 2 Bootstrap Tree 2
0 1 0 1
strap Tree 3 3/21 2/9 Bootstrap Tree 3 1/23 1/7
x.3<-1.575 x.1<-0.965
strap Tree 4 x.3>-1.575
Bootstrap Tree 4 x.1>-0.965
strap Tree 5 Bootstrap Tree 5
1 0 0 0
2/5 0/16 1/5 0/18
epla
ements PSfrag repla
ements
Bootstrap Tree 2 Bootstrap Tree 3
Original Tree 0
Original Tree 0
strap Tree 1 x.2<0.39
11/30
Bootstrap Tree 1 x.4<0.395
4/30
x.2>0.39 x.4>0.395
Bootstrap Tree 2
0 1 0 0
strap Tree 3 3/22 0/8 2/25 2/5
x.3<-1.575 x.3<-1.575
strap Tree 4 x.3>-1.575
Bootstrap Tree 4 x.3>-1.575
strap Tree 5 Bootstrap Tree 5
1 0 0 0
2/5 0/17 2/5 0/20
epla
ements PSfrag repla
ements
Bootstrap Tree 4 Bootstrap Tree 5
Original Tree 0
Original Tree 0
strap Tree 1 x.2<0.255
13/30
Bootstrap Tree 1 x.2<0.38
12/30
x.2>0.255 x.2>0.38
strap Tree 2 Bootstrap Tree 2
0 1 0 1
strap Tree 3 2/16 3/14 Bootstrap Tree 3 4/20 2/10
x.3<-1.385 x.3<-1.61
x.3>-1.385
Bootstrap Tree 4 x.3>-1.61
strap Tree 5
0 0 1 0
2/5 0/11 2/6 0/14
Figure 8.9: Bagging trees on simulated dataset. Top
left panel shows original tree. Five trees grown on boot-
strap samples are shown.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 11 / 35
Bagging Trees
About Bagging Trees
The original tree and five bootstrap trees are all different:
with different splitting features
with different splitting cutpoints
The trees have high variance due to the correlation in the predictors
Averaging reduces variance and leaves bias unchanged.
Under squared-error loss, averaging reduces variance and leaves bias
unchanged.
Therefore, bagging will often decrease MSE.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 12 / 35
Bagging Trees
Elements of Statisti
al Learning
Hastie, Tibshirani & Friedman 2001 Chapter 8
0.35
Original Tree
0.30
•
•
Test Error
•••• •••
••• • •
•••• ••••••• •••
• •• ••••
•••• •• •
••••• •• • •
•••• • •••••• • •
0.25
••• • •••••••••••
••• •••• ••• Bagged Trees
• •• •• • •••
•
•• • ••••••••••
••••• ••
••••••••••• ••
••
•••••••••••••••••••••••
•• ••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••• •••••••••••••••••
•••
••••••••••• •••••• ••••• ••••• •••• •••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••• ••••••••••••••••••••••••••••
Bayes
0.20
0 50 100 150 200
Number of Bootstrap Samples
Figure 8.10: Error
urves for the bagging example of
Figure 8.9. Shown is the test error of the original tree
and bagged trees as a fun
tion of the number of boot-
strap samples. The green points
orrespond to majority
vote, while the purple points average the probabilities.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 13 / 35
Bagging Trees
About Bagging
Bagging can dramatically reduce the variance of unstable procedures
like trees, leading to improved prediction
Bagging smooths out this variance and hence reducing the test error
Bagging can stabilize unstable procedures.
The simple structure in the model can be lost due to bagging
A bagged tree is no longer a tree.
The bagged estimate is not easy to interpret.
Under 0-1 loss for classification, bagging may not help due to the
nonadditivity of bias and variance.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 14 / 35
Random Forest
Random Forest
Random Forests (Breiman, 2001):
Random forest is an ensemble classifier that consists of many decision
trees and outputs the class that is the mode of the class’s output by
individual trees.
Bagging: random subsampling the training set
Random subspace method (Ho, 1995, 1998): random subsampling
the feature space (called “feature bagging”); To reduce the
correlation of the trees in an ordinary bootstrap sample: if one or a
few features are very strong predictors for the response variable
(target output), these features will be selected in many of the trees,
causing them to become correlated.
Random forest: combine bagging and random subspace method.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 15 / 35
Random Forest
Learning Algorithm for Building A Tree
Denote the training size by n and the number of variables by p. Assume
m < p is the number of input variables to be used to determine the
decision at a node of the tree.
Randomly choose n samples with replacement (i.e. take a bootstrap
sample).
Use the rest of the samples to estimate the error of the tree, by
predicting their classes.
For each node of the tree, randomly choose m variables on which to
base the decision at that node. Calculate the best split based on
these m variables in the training set.
Each tree is fully grown and not pruned.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 16 / 35
Random Forest
Properties of Random Forest
Advantages:
highly accurate in many real examples; fast; handles a very large
number of input variables.
ability to estimate the importance of variables for classification.
generates an internal unbiased estimate of the generalization error as
the forest building progresses.
impute missing data and maintains accuracy when a large proportion
of the data are missing.
provides an experimental way to detect variable interactions.
can balance error in unbalanced data sets.
compute proximities between cases, useful for clustering, detecting
outliers, and (by scaling) visualizing the data
can be extended to unlabeled data, leading to unsupervised clustering,
outlier detection and data views
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 17 / 35
Random Forest
Properties of Random Forest
Disadvantages:
Random forests are prone to overfitting for some data sets. This is
even more pronounced in noisy classification/regression tasks.
Random forests do not handle large numbers of irrelevant features
Implementation:
R packages: randomForest; randomForestSRC (for survival data); grf
(generalized random forest for causal effect estimation).
Asymptotics of generalized random forest: Athey et al. (2019)
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 18 / 35
Random Forest
Applications in Causal Inference/Optimal Treatment
Decision
Potential outcomes: Y ∗ (a), an outcome that would result if a patient
were given treatment a ∈ A.
Observed data: response, Y (larger value of Y indicates better
outcome); p-dimensional covariates, X ∈ X ; received treatment,
A ∈ A.
Consider binary treatment case: A = {0, 1}.
Average treatment effect (ATE): ∆ = E {Y ∗ (1) − Y ∗ (0)}.
Conditional treatment effect (CTE):
τ (x) = E {Y ∗ (1)|X = x} − E {Y ∗ (0)|X = x}.
Average treatment effect on the treated (ATT):
∆1 = E {Y ∗ (1)|A = 1} − E {Y ∗ (0)|A = 1}.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 19 / 35
Random Forest
Assumptions for Causal Inference
A1. Consistency assumption:
Y = Y ∗ (1)A + Y ∗ (0)(1 − A)
A2. No unmeasured confounders assumption (strong ignorablility):
{Y ∗ (1), Y ∗ (0)} ⊥
⊥A|X
A3. Positivity assumption: 0 < P(A = 1|X ) < 1 for any X .
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 20 / 35
Random Forest
Estimation of Causal Effects
Note that ∆ 6= E (Y |A = 1) − E (Y |A = 0). (why?)
In fact,
I (A = 1)
E Y = E {Y ∗ (1)},
π(X )
where π(X ) = P(A = 1|X ).
Under assumptions A1-A2, we have
τ (x) = E (Y |A = 1, X = x) − E (Y |A = 0, X = x).
Proof:
E {Y ∗ (1)|X = x} = E {Y ∗ (1)|A = 1, X = x} = E {Y |A = 1, X = x}.
How about ∆1 ?
E {Y ∗ (0)|A = 1} = E {Y ∗ (0)A}/P(A = 1) = E {Y ∗ (0)π(X )}/P(A = 1).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 21 / 35
Random Forest
Estimation of CTE
Two ways of using random forests:
Method I: Fit Y ∼ RF (X , A). Denote the resulting estimator by
µ b(x, 1) − µ
b(X , A). Then, τb(x) = µ b(x, 0).
Method II: Fit Y ∼ RF (X ) separately for A = 1 and A = 0 groups.
Denote the resulting estimators by µb1 (X ) and µ
b0 (X ), respectively.
b1 (x) − µ
Then, τb(x) = µ b0 (x).
Method II is usually better than method I (more accurate).
Under the assumptions A1-A2, it can be shown that the optimal
treatment decision rule is estimated by d opt (x) = I {b
τ (x) > 0}.
The decision rule τb(x) obtained by RF may be difficult to interpret
clinically. In practice, it may prefer a simple and interpretable decision
rule. We may consider a tree classification based on the labels
Zi ≡ I {bτ (Xi ) > 0} and features Xi .
More accurately, we should consider a weighted classification of Zi on
Xi with the weight wi = |b
τ (Xi )|.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 22 / 35
Random Forest
Estimation of ATE
Inverse probability weighted estimation (IPW):
n n
X Ai X (1 − Ai )
∆
b = Yi − Yi ,
π
b(Xi ) 1−πb(Xi )
i=1 i=1
where πb(X ) is estimated propensity score, e.g., using logistic
regression or random forest.
IPW estimator is lack of efficiency (why?).
Augmented inverse probability weighted estimation (AIPW):
n n
X Ai X Ai
∆
b = Yi − −1 µ b1 (Xi )
π
b(Xi ) π
b(Xi )
i=1 i=1
" n n #
X 1 − Ai X 1 − Ai
− Yi − −1 µb0 (Xi ) .
1−π b(Xi ) 1−πb(Xi )
i=1 i=1
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 23 / 35
Random Forest
Estimation of ATE/ATT
AIPW estimator is doubly robust: ∆ b is consistent when either π
b(x) is
consistent or µ
b0 (x) and µ
b1 (x) are consistent.
Alternative estimators of ATE (regression estimators):
n
b = 1
X
∆ {b
µ1 (Xi ) − µ
b0 (x)},
n
i=1
n h
b = 1
X i
∆ Ai {Yi − µ
b0 (Xi )} + (1 − Ai ){b
µ1 (Xi ) − Yi } .
n
i=1
Estimator of ATT:
Pn Pn (1−Ai )bπ(Xi )
i=1 Ai Yi π (Xi ) Yi
i=1 1−b
∆
b 1 = Pn − Pn .
i=1 Ai i=1 Ai
n
X n
X
∆
b1 = Ai {Yi − µ
b0 (Xi )}/ Ai .
i=1 i=1
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 24 / 35
Random Forest
AIDS Data Example
AIDS Clinical Trials Group Study 175 (ACTG175)
HIV-infected patients with CD4 counts 200 ∼ 500/mm3
Randomized to four treatment groups:
zidovudine alone (ZDV)
zidovudine plus didanosine (ZDV+ddI)
zidovudine plus zalcitabine (ZDV+zal)
didanosine alone (ddI)
12 baseline clinical covariates, such as age (years), weight (kg),
Karnofsky score (scale of 0-100), CD4 count (cells/mm3 ) at baseline
Primary endpoints of interest: (i) CD4 count at 20 ± 5 weeks
post-baseline; (ii) the first time that a patient had a decline in their
CD4 cell count of at least 50%, or an event indicating progression to
AIDS, or death.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 25 / 35
Random Forest
Survival Curves
1.0
0.9
0.8
0.7
black: ZDV only
red: ZDV+ddI
blue: ZDV+zal
0.6
green: ddI only
0 200 400 600 800 1000 1200
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 26 / 35
Random Forest
A Closer Look of Top 2 Treatments
A=1: ZDV + ddI
A=0: ZDV + zal
Age ≤ 34 Age > 34
A=1 266 256 522
A=0 263 261 524
529 517 1046
Note: Age 34 is the median age for the patients
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 27 / 35
Random Forest
Heterogeneous Treatment Effects
1.00
1.00
Age <=34 Age > 34
0.95
0.95
Survival Probability
Survival Probability
0.90
0.90
0.85
0.85
0.80
0.80
trt 1 trt 1
trt 0 trt 0
0.75
0.75
0 200 600 1000 0 200 600 1000
Days Days
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 28 / 35
Random Forest
Random Forest Estimation of CTE
library(survival)
a = read.table("C:\\Users\\wlu4\\all.dat",sep=" ")
b = read.table("C:\\Users\\wlu4\\events.dat")
ix = match(as.numeric(b[,1]), as.numeric(a[,1]))
trt = as.numeric(b[!is.na(ix),4])
###Kaplan-Meier curve
t.time = as.numeric(b[!is.na(ix),3])
delta = as.numeric(b[!is.na(ix),2])
km.fit = survfit(Surv(t.time,delta)~trt)
plot(km.fit,ylim=c(0.55,1),col = c("black","red","blue",
"green") )
legend(10, .7, c("black: ZDV only", "red: ZDV+ddI",
"blue: ZDV+zal","green: ddI only"))
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 29 / 35
Random Forest
Random Forest Estimation of CTE
####consider ZDV + ddI & ZDV + zal groups
ix1 = (trt == 1)|(trt==2)
Y = as.numeric(a[ix1,20])
A = as.numeric(trt[ix1]==1)
X = as.matrix(a[ix1,c(2,3,7,19,23,4,5,6,12,13,14,16)])
###random forest fit
library(randomForestSRC)
Y1 = Y[A==1]
X1 = X[A==1,]
data1 = data.frame(cbind(Y1,X1))
fit1 = rfsrc(Y1~.,data=data1)
Y0 = Y[A==0]
X0 = X[A==0,]
data0 = data.frame(cbind(Y0,X0))
fit0 = rfsrc(Y0~.,data=data0)
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 30 / 35
Random Forest
Random Forest Estimation of CTE
mu11 = predict(fit1)
mu10 = predict(fit1,data0)
mu01 = predict(fit0,data1)
mu00 = predict(fit0)
## $predicted.oob: out of bag prediction
tau1 = mu11$predicted.oob - mu01$predicted
tau0 = mu10$predicted - mu00$predicted.oob
tau = numeric(length(Y))
tau[A==1] = tau1
tau[A==0] = tau0
cr = sum((tau > 0)*(Y > 0))/length(Y) #correct decision rate
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 31 / 35
Random Forest
OTR Estimation Using Tree
## Tree classification for optimal treatment rule
library(tree)
Z = as.numeric(tau > 0)
data2 = data.frame(cbind(Z,X))
## fit the original tree with split="deviance"
tree_otr=tree(Z~., data=data2, split="deviance")
## tree summary
summary(tree_otr)
plot(tree_otr)
text(tree_otr)
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 32 / 35
Random Forest
OTR Estimation Using Tree
## 10-fold CV for parameter tuning
set.seed(11)
tree_cv = cv.tree(tree_otr,,prune.tree,K=10)
min_idx = which.min(tree_cv$dev)
min_idx
tree_cv$size[min_idx]
tree_otr_prune = prune.tree(tree_otr, best = 12)
plot(tree_otr_prune)
text(tree_otr_prune)
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 33 / 35
Random Forest
Original Tree Using Deviance
V2 < 35.5
|
V2 < 24.5 V3 < 106.05
V19 < 336 V19 < 463.5
0.9473
0.4000
V23 < 1378
V23 < 688.5 V3 < 83.0044
0.1707
V3 < 68.9472
V14 < 0.5
0.8462
0.1429
0.1111
0.9274
0.3793
0.7059
V3 < 56.7804 V19 < 338.5
0.3333
0.8898
V3 < 75.1488 V3 < 83.0544
V23 < 897.5
0.2453
0.9423
0.2000
0.4118
0.8333
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 34 / 35
Random Forest
After Parameter Tuning based on 10-CV
V2 < 35.5
|
V2 < 24.5 V3 < 106.05
V19 < 336 V19 < 463.5
0.9473 0.4000
V23 < 688.5 V3 < 83.0044
0.1707 0.6970
V14 < 0.5
0.5875 0.1111
0.9274
V19 < 338.5
0.8629
V3 < 75.1488 V3 < 83.0544
0.2453 0.6585
0.9423 0.2000
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 35 / 35