Ensemble learning from theory to practice
Lecture 2: Bagging and Random Forests
Myriam Tami
[email protected]
January - March, 2023
Motivations Bagging Random forests Extremely randomized trees References
1 Motivations
2 Bagging
3 Random forests
4 Extremely randomized trees
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 2 / 33
Motivations Bagging Random forests Extremely randomized trees References
Reminder
In the first course, we saw
Goal of ensemble learning methods:
combine multiple weak learners in order to improve robustness
and prediction performances
Decision tree is an example of weak learners
Figure: source: Hands-On Machine Learning with Scikit-Learn and TensorFlow, A. Géron
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 3 / 33
Motivations Bagging Random forests Extremely randomized trees References
Why to combine? An intuition
Binary classification Y ∈ {−1, 1}, input variables X ∈ Rd
We have a set of K independent initial classification methods (fb )1≤b≤B
such as ∀b,
P {fb (X ) 6= Y } = ε
By aggregating these methods and predicting
B
!
X
F (X ) = sign fb (X )
b=1
By the Hoeffding inequality (not on the course program), the probability
of error of F is:
1 2
P {F (X ) 6= Y } 6 exp − B (2ε − 1)
2
which tends towards zero exponentially in B
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 4 / 33
Motivations Bagging Random forests Extremely randomized trees References
Why to combine? Another intuition
Assume that X1 , . . . , XB are B iid random variables of mean µ = E[X1 ]
and of variance σ 2 = V [X1 ] = E[(X1 − µ)2 ]
Consider the new variable/estimator (empirical mean)
B
1X
X̄ := Xb
B
b=1
The expectation does not change, E[X̄ ] = µ (no bias i.e.,
E[X̄ − µ] = 0, the expected value of the estimator matches that of
the parameter (no error))
The variance is reduced thanks to the decorrelation of the
random variables, V [X̄ ] = B1 σ 2
This is of interest of decision trees: we have seen (cf. Course 1) that
large and unpruned trees have a small bias but a large variance
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 5 / 33
Motivations Bagging Random forests Extremely randomized trees References
Bagging (Bootstrap AGGregatING)
Introduced by Breiman [Breiman, 1996]
Based on two key points : boostrap and aggregation
We know that the aggregation of independent initial predictive
methods (base learners) leads to a significant reduction in error of
prediction and variance
⇒ Get initial methods as independent as possible
Naive idea: train our "base learners" (ex: CART) on subsets of
disjoint observations of the training set
Problem: the training set is not infinite → the "base learners" will
have too little data and poor performance
That is where bootstrapping is useful
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 6 / 33
Motivations Bagging Random forests Extremely randomized trees References
Bagging idea
Bagging create training subsets using bootstrap sampling
[Tibshirani and Efron, 1993]
Boostrapping
To create a new “base learner" fb ,
1 we randomly draw with replacement a dataset Db of ntrain
observations from the training set
2 we learn the method (ex: CART) on it
→ the “base learner" fb is obtained
Note: each Db has the same size as the original training set
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 7 / 33
Motivations Bagging Random forests Extremely randomized trees References
Bagging idea
Bagging
Consists to
1 Do boostrapping B times producing
I B bootstrap datasets Db
I then B predictors (“base learners") fb for each of these datasets
2 Aggregate the predictors
I In the regression case (average),
B
1X
fbag (x) = fb (x) (1)
B
b=1
I In the classification case (majority vote over trees),
B
!
1X
fbag (x) = argmax 1{fb (x)=c} (2)
16c6C B
b=1
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 8 / 33
Motivations Bagging Random forests Extremely randomized trees References
Bagging diagram
Figure: Illustration of the bagging principle (with LΘ
n = Db , ĥ(., Θb ) = f̂b and
l
ĥbag = f̂bag )
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 9 / 33
Motivations Bagging Random forests Extremely randomized trees References
Random forests
Method introduced by Leo
Breiman in 2001
[Breiman, 2001]
Based on older ideas:
Bagging [Breiman, 1996],
decision trees CART
[Breiman et al., 1984]
Proofs of the convergence
[Biau et al., 2008]
Random forests method
belongs to the family of
ensemble methods
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 10 / 33
Motivations Bagging Random forests Extremely randomized trees References
Random forests (notations)
D = {(x1 , y1 ) , . . . , (xn , yn )} the learning set, each (xi , yi ) is
independent, realization from the random variables (X , Y )
X ∈ Rd the input variables; Y ∈ Y the output variable, Y = R for
regression and Y = {1, . . . , C} for classification
Goal: build a predictor f̂ : Rd → Y
Random forests idea
n o
f̂b (., Θb ) , 1 6 b 6 B set of decision tree predictors,
(Θb )16b6B characterises the bth tree in terms of split variables,
cutpoints at each node and terminal-node value,
Random forests predictor f̂ obtained by aggregating the set of trees
B
for regression, 1X
f̂ (x) := f̂b (x, Θb ) (3)
B
b=1
for classification, B
!
1X
f̂ (x) := argmax 1{f̂ (x,Θ )=c} (4)
16c6C B b b
b=1
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 11 / 33
Motivations Bagging Random forests Extremely randomized trees References
Random forests
Random forests consist of growing a large number (ex: 400) of
randomly constructed decision trees before aggregating them
In statistical terms, if the trees are decorrelated, this reduces the
variance of the predictions
Naïve example with 2 trees,
24.7+23.3
Averaging these regression trees allows the prediction 2 = 24
source: Arbres de décision et forêts aléatoires, P. Gaillard, 2014
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 12 / 33
Motivations Bagging Random forests Extremely randomized trees References
The problem of correlation between trees
Bagging idea: aggregate many noisy but approximately unbiased1
tree models to reduce the variance
However, there is necessarily some overlap between
bootstrapped datasets
⇒ the trees corresponding to each of them are correlated
Intuition: if B trees fb (x) are identically distributed, of variance σ 2 ,
with a correlation coefficient ρ = Corr (fb (x), fb0 (x)), ∀b 6= b0 , the
variance of their mean is then,
(1 − ρ) σ 2
V fbag (X ) = ρσ 2 +
(5)
B
Thus, the variance cannot be shrink below ρσ 2
⇒ Disadvantage of bagging
1
if sufficiently deep
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 13 / 33
Motivations Bagging Random forests Extremely randomized trees References
Create low correlated trees
We saw the disadvantage of
Bootstrapping: rather than using all the data to build the trees, we
choose randomly for each tree a subset (with possible repetition)
of the training data.
Let introduce the improvement proposed by random forests: is to lower
the correlation between trees (without increasing the variance too
much) using an additional step of randomization,
a random choice of the input feature j used to split each node
during the tree-growing process
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 14 / 33
Motivations Bagging Random forests Extremely randomized trees References
Definition of Random forests
Definition (Random forests)
A random forest is a set of trees growing on a boostrapped learning
data set and where, before each split, a set of m d input variables
(or features) is randomly selected as candidates for splitting
Note: m is the same for all the nodes of all the trees of the forest but, of
course, the variables considered in each node for the choice of the
best split changes randomly
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 15 / 33
Motivations Bagging Random forests Extremely randomized trees References
Algorithm of Random forests
n o
1: Require: A dataset D = (xi , yi )16i6n , the size B of the ensemble, the number
m of candidates (features) for splitting
2: for b = 1 to B do
3: Draw a bootstrap dataset Db of size n from the original training set D
4: Grow a random tree f̂b using the bootstrapped dataset:
5: repeat
6: for all terminal node do
7: Select m variables among d, at random
8: Pick the best variable and split-point couple among the m
9: Split the node into two child nodes
10: end for
11: until the stopping criterion is met (e.g., minimum number of sample per node
reached)
12: end for
return: the ensemble of B trees
Algorithm 1: Pseudo-code to build Random forests for regression or clas-
sification
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 16 / 33
Motivations Bagging Random forests Extremely randomized trees References
Random forests in practice
Intuitively, reducing m will reduce the correlation between any
pairs of trees in the ensemble
⇒ reduce the variance of the average (cf. Eq. (5))
However, the corresponding hypothesis space will be smaller,
leading to an increased bias
Heuristics,
for regression, choose m = b d3 c and a minimum node size of 5
√
for classification, choose m = b dc and a minimum node size of 1
For further information about random forests, you can refer to
[Hastie et al., 2009] (Chap. 15) that provides a bias-variance analysis
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 17 / 33
Motivations Bagging Random forests Extremely randomized trees References
OOB-error (Out-Of-Bag error)
OOB error (out of bootstrap samples) - the principle
To predict yi , we only aggregate the predictors f̂b (., Θb ) built on
bootstrap samples not containing (xi , yi )
for regression, n
1X
OOB-error = (yi − ŷi )2
n
i=1
where ŷi from an aggregation of f̂b built on Db \ (xi , yi )
for classification,
n
1X
OOB-error = 1yi 6=ŷi
n
i=1
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 18 / 33
Motivations Bagging Random forests Extremely randomized trees References
OOB-error
Many advantages,
An OOB-error estimate is almost identical to that obtained by
N-fold cross-validation
Random forests can be fit in one sequence, with cross-validation
being performed along the way
Once the OOB-error stabilizes, the training can be terminated and
B value is obtained/ tunned
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 19 / 33
Motivations Bagging Random forests Extremely randomized trees References
Variable Importance (VI)
Random Forests (RF) allow to rank the explanatory variables in
order of importance for the prediction
In the RF framework, permutation importance indices are
preferred to total decrease of node impurity measures already
introduced in Breimanet al.(1984)
The default Scikit-learn’s feature importances is based on the
decrease of node impurity
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 20 / 33
Motivations Bagging Random forests Extremely randomized trees References
Variable Importance based on impurity decrease
For one tree,
Variable Importance (VI) of Xj is calculated by the sum of the
decrease in error/impurity when split by the variable Xj
e.g., if Xj is used 2 times to split a terminal node in the tree → you
will sum these two decreases in Gini index (or cross-entropy, etc)
to obtain its VI
The relative importance is the VI divided by the highest VI value
(normalization)
⇒ Values are bounded between 0 and 1
In the case of RF,
we are talking about averaging the decrease in impurity over
trees
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 21 / 33
Motivations Bagging Random forests Extremely randomized trees References
Variable Importance based on impurity decrease
Pros,
Fast calculation
Easy to obtain via Scikit-learn: one command
feature_importances_
Cons,
Biased approach: it has a tendency to inflate the importance of
continuous features or high-cardinality categorical variables
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 22 / 33
Motivations Bagging Random forests Extremely randomized trees References
Variable Importance based on permutation
This approach directly measures the feature importance of an input
variable Xj by observing how random permutation of its values (thus
preserving the distribution of the variable) influences model
performance
Goal: measure the prediction strength of each variable
Figure: Illustration of the values permutation from one variable (source: Arbres CART et
Forêts aléatoires - Importance et sélection de variables, R. Genuer and J-M. Poggi, 2016)
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 23 / 33
Motivations Bagging Random forests Extremely randomized trees References
Variable Importance based on permutation
The process is the following,
1 Grow the RF on the learning set
2 Record the OOB-error E
3 Permute at random the j-th variable values of these data
4 Pass this modified dataset to the RF again to obtain predictions
5 Compute the OOB-error on this modified dataset
6 The VI of Xj is the difference between the benchmark score E and
the one from the modified (permuted) dataset
⇒ The more the increase of OOB error is, the more important is the
variable
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 24 / 33
Motivations Bagging Random forests Extremely randomized trees References
Variable Importance based on permutation
Pros,
Reasonably efficient
Reliable technique
No need to re-train the model at each modification of the dataset
Cons,
More computationally expensive than the default
feature_importances_
Permutation importance overestimates the importance of
correlated predictors [Strobl et al., 2008]
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 25 / 33
Motivations Bagging Random forests Extremely randomized trees References
Anomalies detection
RFs are well suited to detecting outliers [Liu et al., 2008]
These are indeed quickly isolated in a separate leaf
The anomaly score of an observation xi is determined
approximately by the average length of the path from xi to the
leaves of trees in the forest
The shorter the path, the more likely the observation is atypical
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 26 / 33
Motivations Bagging Random forests Extremely randomized trees References
Pros and cons of Random forests
Pros
no over-learning
usually: better performance than decision trees
direct computation of the"Out-of-Bag" error: cross validation not
required
hyper-parameters (B, m) easy to tune
Cons
black box: difficult to interpret
slower training
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 27 / 33
Motivations Bagging Random forests Extremely randomized trees References
Extremely randomized trees
Randomization can be pushed further with extremely randomized
forests
Method introduced by [Geurts et al., 2006]
It is a RF with two differences,
I m < d of the input variables are selected at random and for each of
these variables a split point is chosen at random
I The full learning set D is used to growth each tree (instead of a
bootstrapped learning set (Db )1≤b≤B )
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 28 / 33
Motivations Bagging Random forests Extremely randomized trees References
Impact on correlation and bias
Theses two differences:
1 Using the full learning set
⇒ achieve a lower bias
But, the price is an increased variance
That should be compensated by the randomization of split-points,
2 choosing also the split-point at random
⇒ reduce the correlation between trees to reduce the variance of
the average of the ensemble more strongly
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 29 / 33
Motivations Bagging Random forests Extremely randomized trees References
Algorithm of Extremely Randomized
n o
Forest
1: Require: A dataset D = (xi , yi )16i6n , the size B of the ensemble, the number
m of candidates for splitting
2: for b = 1 to B do
3: Grow a random tree using the original dataset D:
4: repeat
5: for all terminal node do
6: Select m variables among d, at random
7: for all sampled variables do
8: Select a split at random
9: end for
10: Pick the best variable and split-point couple among the m candidates
11: Split the node into two child nodes
12: end for
13: until the stopping criterion is met (e.g., minimum number of sample per node
reached)
14: end for
return: the ensemble of B trees
Algorithm 2: Pseudo-code describing Extremely Randomized Forest ap-
proach
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 30 / 33
Motivations Bagging Random forests Extremely randomized trees References
Extremely Randomized Forest
Advantages of this approach,
Empirically, it often provides better results than RFs
Lower computational complexity compared to RFs (one chooses
the split among the m randomly drawn split-points)
Disadvantages
Black box: difficult to interpret
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 31 / 33
Motivations Bagging Random forests Extremely randomized trees References
References I
Biau, G., Devroye, L., and Lugosi, G. (2008).
Consistency of random forests and other averaging classifiers.
Journal of Machine Learning Research, 9(Sep):2015–2033.
Breiman, L. (1996).
Bagging predictors.
Machine learning, 24(2):123–140.
Breiman, L. (2001).
Random forests.
Machine learning, 45(1):5–32.
Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984).
Classification and regression trees.
CRC press.
Geurts, P., Ernst, D., and Wehenkel, L. (2006).
Extremely randomized trees.
Machine learning, 63(1):3–42.
Hastie, T., Tibshirani, R., and Friedman, J. (2009).
The elements of statistical learning: data mining, inference, and prediction.
Springer Science & Business Media.
Liu, F. T., Ting, K. M., and Zhou, Z.-H. (2008).
Isolation forest.
In 2008 Eighth IEEE International Conference on Data Mining, pages 413–422. IEEE.
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 32 / 33
Motivations Bagging Random forests Extremely randomized trees References
References II
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. (2008).
Conditional variable importance for random forests.
BMC bioinformatics, 9(1):307.
Tibshirani, R. J. and Efron, B. (1993).
An introduction to the bootstrap.
Monographs on statistics and applied probability, 57:1–436.
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 33 / 33