Probability Estimation in Random Forests
Probability Estimation in Random Forests
DigitalCommons@USU
5-2013
Recommended Citation
Li, Chunyang, "Probability Estimation in Random Forests" (2013). All Graduate Plan B and other Reports.
312.
https://digitalcommons.usu.edu/gradreports/312
by
Chunyang Li
of
MASTER OF SCIENCE
in
Statistics
Approved:
John R. Stevens
Committee Member
2013
ii
ABSTRACT
by
Random Forests is a useful ensemble approach that provides accurate predictions for
classification, regression and many different machine learning problems. Classification has
been a very useful and popular application for Random Forests. However, it is preferable
to have the probability of a membership rather than the simple knowledge that one belongs
to whichever group. Votes and the regression method are current probability estimation
methods that have been developed in Random Forests. In this thesis, we introduce two
new methods, proximity weighting and the out-of-bag method, trying to improve the cur-
rent methods. Several different simulations are designed to evaluate the new methods and
compare them with the old ones. Finally, we use real data sets from UCI machine learning
I dedicated this to my parents, who love me deeply and give me greatest support.
v
ACKNOWLEDGMENTS
I would like to thank my advisor, Dr. Adele Cutler, for your patience on me and the
time you spent helping me with this research. Thanks for setting me an awesome example of
academic excellence, enthusiasm and work ethic. Thanks for giving me advice on everything
I would also like to thank my committee members, Dr. Sun and Dr. Stevens, for your
help on a lot of my course work that make me better prepared for this research. Thanks
for all your advice and your time being my committee members.
Finally, I would like to thank all my officemates and friends in the department, who
help me and give me a lot of support. Thank my roommates for giving me great friendship
Chunyang Li
vi
CONTENTS
Page
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Probability Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Probability Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Probability Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Proximity Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Out-of-bag Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 XOR model ( XOR1 and XOR2 ) . . . . . . . . . . . . . . . . . . . . . 8
3.2 2-dimensional Circle Model (2D Circle) . . . . . . . . . . . . . . . . . . 9
3.3 10-dimensional Model (Friedman) . . . . . . . . . . . . . . . . . . . . . 9
3.4 Bivariate Normal Model (Binorm) . . . . . . . . . . . . . . . . . . . . . 10
3.5 Normal Cluster Mixtures (Clusters) . . . . . . . . . . . . . . . . . . . . 11
3.6 Method to Measure the Performance of Class Probability Estimators . . . 12
3.6.1 Mean Squared Loss Function . . . . . . . . . . . . . . . . . . . . . 12
3.6.2 Misclassification Error Rate . . . . . . . . . . . . . . . . . . . . . . 13
3.7 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.7.1 Results for Mean Squared Loss . . . . . . . . . . . . . . . . . . . . 14
3.7.2 Results for Misclassification Error Rate . . . . . . . . . . . . . . . 15
3.7.3 Normal Cluster Mixtures with Different Ntrees . . . . . . . . . . . 16
4 Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1 Brier Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 10-fold Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.1 Na.roughfix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
vii
4.3.2 RfImpute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Results for Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
viii
LIST OF TABLES
Table Page
3.7 Mean Squared Loss for Normal Cluster Mixtures with Different Ntrees . . . 16
3.8 Misclassification Error Rate for Normal Cluster Mixtures with Different Ntrees 16
LIST OF FIGURES
Figure Page
3.2 Training data (left) and true probabilities of being in class 1 (right) for the
1.1 Classification
Classification is the problem of predicting which class a new observation belongs to,
given a training set of data from K known classes. One way to think about classification is
that it’s like regression but we have a categorical response variable instead of a continuous
response.
i = 1, . . . , N . The values of yi refer to the classes and the problem is to classify a new
observation xnew into one of the K classes, i.e. to estimate ŷnew . No formal distributional
assumptions are made, although there are implicit assumptions that “nearby” values of x
Most machine learning approaches only provide a classification result. However, it is not
enough and probabilities are essential in some cases like predicting diseases. It is important
to estimate the probability of belonging to each of the groups rather than making a simple
statement that a patient is in one group or another. Therefore, our research problem is
More formally, let fk (x) denote the density function for observations in class k. Then
k
X
f (x) = πk fk (x)
k=1
PK
where 0 ≤ πk ≤ 1 and k=1 πk = 1. We can view the data (x1 , y1 ), . . . , (xN , yN ) as
P (y = k) = πk
and
f (x|y = k) = fk (x)
2
so
X
f (x) = πk fk (x)
k
pk = P (y = k|X = xnew )
πk fk (xnew )
=
f (x)
1.3 Trees
all the observations. Observations in the node are sent to the descendant nodes, using a
“split” on a single predictor variable. The initial (“root”) node contains the data of interest
and nodes that can’t be split are called “terminal” nodes. At any stage of the tree-growing
process, a node contains observations in some or all of the classes {1, . . . , K} and the y-
Considering every possible split on every predictor variable, a particular split of a node
is chosen. The predictor and split combination giving the “best” value according to some
criterion, such as Gini index, entropy, etc., is used to partition the node. In this report, we
For continuous predictors, each split is of the form xj < c for some c and j ∈ {1, . . . , M },
where xj denotes the value of variable j. Observations that have xj < c go to the left
descendant node, and the others go to the right. The values of j and c for each node are
found by minimizing a measure of the “badness” of the split based on the Gini index. For
index is
X nk nk 0
G(n) = .
N N
k6=k0
A particular choice of c and j will give left and right nodes summarized by nL = nL (c, j)
3
and nR = nR (c, j) with size SL and SR where SL + SR = S. The values of c and j are
chosen to minimize
SL G(nL ) + SR G(nR ).
Denoting the sorted sample values of variable j in the node by x(1)j , . . . , x(S)j , the
c ∈ (x(i)j + x(i+1)j )/2, i = 1, . . . , S − 1 .
The minimization over both j and c is performed using an exhaustive search over all
combinations of c and j as described above. For a given variable j, the Gini index is updated
The trees are usually grown until a stopping criterion is met, or the nodes contain a
To predict the class of xnew , we start from the root node and move down the tree. At
a node, if the jth component of xnew satisfies the appropriate condition, it goes to the left,
otherwise it goes to the right. In this way, xnew ends up in a terminal node. The predicted
class at the terminal node is the most popular class among the training data at the node
The Random Forest method was introduced by Leo Breiman in 2001 [1] and is a very
useful tool for machine learning. It is a combination of tree predictors such that each tree
depends on the information of a bootstrap sample from the original data (training set). It
can be used for classification and regression. Here, we are only considering Classification
Random Forests.
4
Trees like those described in sections 1.3 are used in Random Forests as described
in the following. The parameters ntree and m (the number of must be chosen. The value
of ntree, the number of trees in the forest, can be chosen as large as desired. There is no
penalty, in terms of fit, for choosing ntree too large, but the fit may be poor if it is too
small. The default value of ntree in the R implementation is 500. The value of m, which
controls the number of randomly chosen predictors at each node (as described in Section
√
1.3), is usually taken to be an integer close to M .
To construct the forest, suppose we have a training set (x1 , y1 ), . . . , (xN , yN ) and let A =
1. Take a bootstrap sample from the data (sample N times, at random, with replace-
ment).
3. Denote the set of observations not appearing in the bootstrap sample by Btc = A \ Bt .
4. Fit a classification tree (Section 1.3) to the bootstrap sample, splitting until all the
observations in each terminal node come from the same class (“pure”).
5. Find the predicted class at each terminal node (the class of members of Bt in the
node).
The predicted class for observations in the training set is the most frequent class in the trees
for which the observation is a member of Btc . This process is often described as “voting”
the trees. For new observations, all the trees in the forest are used in the voting. The
observations in Btc are said to be “out-of-bag” for tree t. The “out-of-bag” error rate is the
5
error rate of the Random Forest predictions for the training set. These predictions are also
Let the proximity between the ith and jth observations be the proportion of the time
observations i and j are in the same terminal node, where the proportion is taken over trees
Pntree
I(i ∈ Btc )I(j ∈ Btc )I(qt (i) = qt (j))
prox(i, j) = t=1 P
ntree I(i ∈ B c )I(j ∈ B c )
t=1 t t
where I denotes the indicator function. To find the proximity between an observation from
the training set, and one from the test set, we use
ntree
1 X
prox(i, j) = I(qt (i) = qt (j)).
ntree
t=1
6
CHAPTER 2
PROBABILITY ESTIMATION METHODS
Let pi,k = P (Y = k|X = xi }. We compare four methods of estimating pi,k . The first
method, based on voting the trees, is the “standard” or “default” method. The second
method is a regression method introduced by Malley et al. [10]. The third and fourth
methods, labeled “Proximity Weighting” and “Out-of-bag” are new methods developed for
2.1 Voting
is passed down the tree. For training data, the proportion is only taken over trees for which
xi is out-of-bag.
This method only works for two-class problems (K = 2). The two classes are denoted
0 and 1 and Random Forests is run in regression mode. Random Forests for regression
are analogous to Random Forests for classification, except that the splits are chosen to
minimize the mean squared error instead of Gini, the predictions for a tree are the average
of the y-values of Bt in the nodes, and the predicted values for the forest are obtained
by averaging over the trees for which the observation is in Btc . Once the Random Forests
reg reg
regression predictions {ŷ1 , . . . , ŷN } are obtained, we set p̂i,1 = ŷi and p̂i,0 = 1 − ŷi .
The motivation for this method comes from the fact that Random Forests can be
thought of as a “nearest neighbor” classifier [4] and proximities are believed to correspond
to the neighborhoods of the nearest neighbor classifier. The proximity weighting method
P
prox j∈A,j6=i prox(i, j)I(yj = k)
p̂i,k = P .
j∈A,j6=i prox(i, j)
The weighted average means that observations that are “close” have more impact on
the probability estimate than observations that are not “close” (where “close” is defined
using the Random Forest proximities). For more information about proximities, see A.
Cutler et al.[2]
The motivation for this method comes from the fact that when fitting a classifier using
a single tree, the terminal nodes are not pure and the relative frequency of class k in a
terminal node can be used to estimate the probability of class k for observations in that
terminal node. In Random Forests, the terminal nodes are pure so the information from
members of Bt is not useful. However, the out-of-bag observations (those in Btc ) do give
For tree t, if a node contains members of Btc , we use the relative frequency of class k
and denote the set of nodes that do not contain members of Btc by Q0t . The out-of-bag
estimates of the probabilities are found by averaging over the nodes that contain out-of-bag
observations: Pntree
/ Q0t )p̂qt (i),k
I(qt (i) ∈
p̂oob
i,k =
t=1
Pntree .
t=1 t / Q0 )
I(q (i) ∈ t
2.5 Software
We coded everything in R and used R package ”randomForest” [9] and ” Rcpp” [3].
8
CHAPTER 3
SIMULATION
The simulation designs are used to represent scenarios where a binary response variable
We used five simulation models in order to compare the probability estimation methods.
The first one is the XOR model. The second and the third one are from Mease et al. [11],
where they considered a simple two-dimensional circle model and 10 dimensional model.
The fourth one is a bivariate normal mixture model and the last one is a mixture of normal
clusters taken from the Elements of Statistical Learning [5]. We also make the plots for
square [−1, 1]2 and let y be a categorical variable with values 0 or 1 defined as
0 x1 x2 ≤ 0;
y=
1 x1 x2 > 0.
probability of Y = 1 given X is
0 x1 x2 ≤ 0;
P (y = 1|X = x) =
1 x1 x2 > 0.
A second case of the XOR model is overlapping 10% based on the the model described
above. We sample 10% observations from class 0 without replacement and switch it with
the same number of observations from class 1. The conditional probability Y = 1 given X
is:
0.1 x1 x2 ≤ 0;
P (y = 1|X = x) =
0.9 x1 x2 > 0.
.
9
Let X be a two-dimensional random vector uniformly distributed on the square [0, 50]2 and
of p(x) to be concentric circles with center (25, 25). The conditional probability of Y = 1
where r(x) is the distance from x to the point (25, 25) in R2 . This is called the 2-Dimensional
Circle model. The right panel of figure3.2 shows these probabilities for a hold-out sample
with that is a grid 2000 evenly spaced points on [0, 50]2 . The greyness of the plot shows
the probabilities: the lighter the color is, the smaller the probabilities are.
to N10 (0, I) and let Y be a categorical variable with values 0 or 1 depending on X. The
10
Figure 3.2. Training data (left) and true probabilities of being in class 1 (right) for the
2-Dimensional Circle Model
follows:
6
p(x) X
log = r[1 − x1 + x2 − x3 + x4 − x5 + x6 ] xj
1 − p(x)
j=1
We choose r=10, which mimics the simulation of Friedman’s model exactly. [6]
Let X be a random matrix consisting of two classes that belong to two different bi-
variate normal distributions with mean µ0 and µ1 and covariance Σ0 and Σ1 . Let Y be
as:
π1 f1 (x)
f (Y = 1|X = x) =
π0 f0 (x) + π1 f1 (x)
where π0 and π1 are the probabilities that Y = 0 and Y = 1 respectively, f0 (x) and f1 (x)
We simulate 3 different representative cases for this model, labeled as binorm1, binorm2
11
1 0.55
and binorm3. Binorm1 has class 0 with mean µ0 = (0, 0)T ,
covariance Σ0 =
0.55 1
T 1 −0.55
and class 1 with mean µ1 = (0, 2.5) , covariance Σ1 = . Binorm2 has
−0.55 1
1 0.47
class 0 with mean µ0 = (0, 0)T , covariance Σ0 = and class 1 with mean
0.47 1
1 −0.47
µ1 = (1, 1)T , covariance Σ1 = . Binorm3 has class 0 with mean µ0 =
−0.47 1
1 −0.8
(0, 0)T , covariance Σ0 = and class 1 with mean µ1 = (1, 1)T , covariance
−0.8 1
1 −0.8
Σ1 = . Class 0 is generated with probability π0 = 2/3 for the three cases.
−0.8 1
generated from a bivariate normal distribution N2 ((1, 0)T , I) and 10 more means µi {mui1 , . . . , µi10 }
are drawn from another bivariate normal distribution N2 ((0, 1)T , I). We are generating a
sample with N observations in total and the sample sizes for class 0 and class 1 are n0 and
n1 . Then for each class, we generate n0 and n1 observations as follows: for each observation,
a mean µ is picked at random with probability 1/10 from {muj1 , . . . , µj10 } for class 0 and
from {mui1 , . . . , µi10 } for class 1. Then the observation is drawn from a new bivariate nor-
mal distribution with the selected mean and covariance Σ = I/5. Y is the binary response
formula given above in the multivariate normal model. However, f0 (x) and f1 (x) are not
directly given and they are the average of the pdf’s of 10 different bivariate normal distri-
butions.
In our simulation, class 0 is generated with probability π0 = 2/3. The plot below is a
We choose mean squared loss function to quantify the accuracy of the probability
Another way we use to measure the accuracy of the probability estimates for the
simulations is misclassification error rate. We give the classification results according the
the probability estimations, i.e. an observation belongs to the class with bigger probability.
The trees are built based on the simulation training data with sample size N. There are
500 trees in total for each case, i.e. ntree=500. Changing ntree didn’t change the squared
loss for each method from the early experiments we did (Section 3.7.3), therefore, we are
only showing the results when ntree=500. We use the test data from 100 independent
sample with sample size 1000 to evaluate the probability estimates. The final mean squared
loss of the sample is the average of the losses for the 100 times’ simulations. So does the
The data in the tables are all in 10−3 scale, i.e. the real mean squared lost should
be the value in the table times 0.001. All the simulation data only have two classes, i.e.
15
response variable with 2 categories. The bolded value is the smallest mean squared loss in
a row.
The results given by misclassification error rate are a little different from the ones
16
measured by mean squared loss function. See chapter 5 for detailed conclusions.
We set N=500 as the sample size for training data set to build the tree. We also use the
test data from 100 independent sample with sample size 1000 to evaluate the probability
estimates. The final mean squared loss of the sample is the average of the losses for the 100
Table 3.7. Mean Squared Loss for Normal Cluster Mixtures with Different Ntrees
ntree vote prox regrf oob
100 139.75 150.10 133.81 118.46
500 135.38 144.36 130.67 115.27
1000 130.79 140.17 126.08 111.14
1500 134.41 144.23 129.58 113.97
2000 135.95 146.20 131.29 116.67
Table 3.8. Misclassification Error Rate for Normal Cluster Mixtures with Different Ntrees
ntree vote prox regrf oob
100 0.32301 0.30157 0.31923 0.30653
500 0.18684 0.17452 0.18284 0.14952
1000 0.29892 0.28013 0.29689 0.28419
1500 0.3092 0.28997 0.30695 0.29367
2000 0.29299 0.27547 0.29019 0.27854
17
CHAPTER 4
DATA EXAMPLES
The Brier score is calculated by the original definition according to Wikipedia [13]. It
is defined:
N K
1 XX
BS = (fti − oti )2 ,
N
n=1 i=1
where N is the number of cases and K is the number of the classes for the events. fti is the
estimated probability, oti is the actual outcome of the event at case t (0 if it doesn’t occur
and 1 if it occurs). For binary response variables, there is an alternative definition, which
model or algorithm. The general idea of cross validation is dividing the data into two parts:
one is the training data set and is used to build the model; the other one is testing data set
10-fold cross validation is mostly commonly used in machine learning. The original
data set is randomly separated into 10 subsets with equal sample size N/10. 1 subset is
chosen as the testing set and the other 9 are the training sets. Each time we choose a
different testing set and repeat the procedure 10 times, therefore all the observations are
tested exactly once. The estimation is obtained by combining the results from each testing
set. [12]
There are many ways to deal with missing values and we replace them in two ways:
4.3.1 Na.roughfix
Na.roughfix is an R function in the Random Forests package which replaces the NAs
with column medians if the variables are numeric, or the most frequent levels if the variables
18
4.3.2 RfImpute
There is another function to deal with missing data called “rfImpute” in the Random
Forest R package. The algorithm starts by imputing NAs using na.roughfix. Then there
aren’t any missing values in the data set and randomForest is used to obtain the proximity
matrix, therefore updating the imputation of the NAs. For continuous variables, the im-
puted value is the weighted average of the non-missing values, in which the weights are the
proximities. For categorical variables, the imputed value is the category with the largest
The methods are compared on 23 datasets from UCI Repository. We use 10-fold cross
We used ntree=500 for all the datasets. The regression method only works for a binary
response variable, therefore, we put NAs in the tables when there are more than 2 classes.
19
CHAPTER 5
CONCLUSIONS
For the Simulations, sample sizes have some effects on the performance of each methods.
For the third case of the bivariate normal model: when N = 100, vote gives the smallest
mean squared loss; when N = 500, the regression method makes the mean squared loss
smallest; when N = 1000, the proximity weighting method turns out to be the best. For
the Freidman model, vote is the best probability estimation method when the sample size is
100 and 500, however, when the sample size increases to 1000, proximity weighting performs
better.
Regardless of the sample size effects, the out-of-bag method performs better than other
methods in XOR with 10% over lap, 2-dimensional circle model, two bivariate normal models
and normal cluster mixtures. Proximity weighting does best in the XOR model without
overlap. Proximity weighting does better than votes or regression in 2-dimensional circle
model.
The misclassification error rate criterion gives a little different results: When N = 100
and 500, the out-of-bag method is slightly better than the proximity weighting method for
XOR model without overlap; when N = 500 and 1000, the proximity method gives smallest
misclassification error rate instead of the out-of-bag method for 2-dimensional circle model
and the Binorm1; The proximity weighting method performs best instead of the out-of-bag
method for Binorm2 when N = 100 and the normal cluster mixtures when N = 1000.
Only considering the misclassification error rate measurement, the out-of-bag method
gives the smallest misclassification error rate. However, when the sample size of training
data set increases, the proximity method tends to be better. Both the two methods ap-
parently perform better than other methods, however, either of them can beat the other
consistently.
Testing all the methods on real datasets, however, no methods did consistently better
than all other methods. When we replace missing values with the rfImpute method, prox-
imity weighting does best in more data sets. When we replace missing values with the
22
na.roughfix method, among the three methods, vote, proximity weighting and oob, no one
is consistently better than the others. However, the regression method doesn’t perform well
when it is applicable, no matter what method we use to deal with missing data.
23
CHAPTER 6
DISCUSSION AND FUTURE WORK
The two new methods, proximity weighting and out-of-bag method, outperform the
current methods in almost all the simulations. However, experimenting on the real data
sets, the the out-of-bag method doesn’t perform as well as it performs in the simulations
and proximity weighting isn’t significantly better than votes, based on the 23 data sets from
the UCI machine learning repository. I would like to experiment on more data sets to have
In the simulations, sample size has very significant influence on the third case of the
bivariate normal model according to mean squared loss measurement. I would like to
further explore why this happens by doing more different simulations for the binorm model
and thinking about it theoretically. Sample size has even more influence according to the
misclassification error rate measurement, it is worthwhile to find out the reasons. Also
I would like to figure out why proximity weighting performs better than the out-of bag
method in XOR model without overlap. The 10-dimensional model is another simulation
I would like to do more theoretical research on to figure out why the out-of-bag method
The current version of the code for the out-of-bag method only works for numeric pre-
dictors, we have to convert the non-numeric predictors into numeric ones. Hence, we would
One interesting feature of the real data comparisons is that the missing value imputation
REFERENCES
[3] D. Eddelbuettel and R. Francois, Rcpp: Seamless r and c++ integration, http://-
cran.r-project.org/web/packages/Rcpp/index.html, (2013).
library/randomForest/html/rfImpute.html.
randomForest/html/na.roughfix.html.
[9] , randomforest: Breiman and cutler’s random forests for classification and regression,
http://cran.r-project.org/web/packages/randomForest/index.html, (2012).
[11] D. Mease, A. J. Wyner, and A. Buja, Boosted classification trees and class probabil-