Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
41 views33 pages

Bagging & Random Forests Lecture

The document discusses ensemble learning methods for classification and regression. It introduces bagging, which creates multiple predictive models by training base learners on bootstrap samples of a training dataset and combining their predictions through voting or averaging. Random forests are then presented as an extension of bagging that adds additional randomness when training each tree. Specifically, random forests grow many decision trees where, at each node, a random subset of features is selected as split candidates. This decorrelation of trees reduces the variance of predictions compared to a single decision tree. Extremely randomized trees are also mentioned as another variant of random forests that injects more randomness into the tree construction process.

Uploaded by

Houssam Fouki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views33 pages

Bagging & Random Forests Lecture

The document discusses ensemble learning methods for classification and regression. It introduces bagging, which creates multiple predictive models by training base learners on bootstrap samples of a training dataset and combining their predictions through voting or averaging. Random forests are then presented as an extension of bagging that adds additional randomness when training each tree. Specifically, random forests grow many decision trees where, at each node, a random subset of features is selected as split candidates. This decorrelation of trees reduces the variance of predictions compared to a single decision tree. Extremely randomized trees are also mentioned as another variant of random forests that injects more randomness into the tree construction process.

Uploaded by

Houssam Fouki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Ensemble learning from theory to practice

Lecture 2: Bagging and Random Forests

Myriam Tami

[email protected]

January - March, 2023


Motivations Bagging Random forests Extremely randomized trees References

1 Motivations

2 Bagging

3 Random forests

4 Extremely randomized trees

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 2 / 33
Motivations Bagging Random forests Extremely randomized trees References

Reminder
In the first course, we saw
Goal of ensemble learning methods:
combine multiple weak learners in order to improve robustness
and prediction performances
Decision tree is an example of weak learners

Figure: source: Hands-On Machine Learning with Scikit-Learn and TensorFlow, A. Géron

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 3 / 33
Motivations Bagging Random forests Extremely randomized trees References

Why to combine? An intuition


Binary classification Y ∈ {−1, 1}, input variables X ∈ Rd
We have a set of K independent initial classification methods (fb )1≤b≤B
such as ∀b,
P {fb (X ) 6= Y } = ε
By aggregating these methods and predicting
B
!
X
F (X ) = sign fb (X )
b=1

By the Hoeffding inequality (not on the course program), the probability


of error of F is:
 
1 2
P {F (X ) 6= Y } 6 exp − B (2ε − 1)
2

which tends towards zero exponentially in B


M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 4 / 33
Motivations Bagging Random forests Extremely randomized trees References

Why to combine? Another intuition


Assume that X1 , . . . , XB are B iid random variables of mean µ = E[X1 ]
and of variance σ 2 = V [X1 ] = E[(X1 − µ)2 ]
Consider the new variable/estimator (empirical mean)
B
1X
X̄ := Xb
B
b=1

The expectation does not change, E[X̄ ] = µ (no bias i.e.,


E[X̄ − µ] = 0, the expected value of the estimator matches that of
the parameter (no error))
The variance is reduced thanks to the decorrelation of the
random variables, V [X̄ ] = B1 σ 2
This is of interest of decision trees: we have seen (cf. Course 1) that
large and unpruned trees have a small bias but a large variance
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 5 / 33
Motivations Bagging Random forests Extremely randomized trees References

Bagging (Bootstrap AGGregatING)

Introduced by Breiman [Breiman, 1996]


Based on two key points : boostrap and aggregation
We know that the aggregation of independent initial predictive
methods (base learners) leads to a significant reduction in error of
prediction and variance
⇒ Get initial methods as independent as possible
Naive idea: train our "base learners" (ex: CART) on subsets of
disjoint observations of the training set
Problem: the training set is not infinite → the "base learners" will
have too little data and poor performance
That is where bootstrapping is useful

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 6 / 33
Motivations Bagging Random forests Extremely randomized trees References

Bagging idea

Bagging create training subsets using bootstrap sampling


[Tibshirani and Efron, 1993]
Boostrapping
To create a new “base learner" fb ,
1 we randomly draw with replacement a dataset Db of ntrain
observations from the training set
2 we learn the method (ex: CART) on it
→ the “base learner" fb is obtained

Note: each Db has the same size as the original training set

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 7 / 33
Motivations Bagging Random forests Extremely randomized trees References

Bagging idea
Bagging
Consists to
1 Do boostrapping B times producing
I B bootstrap datasets Db
I then B predictors (“base learners") fb for each of these datasets
2 Aggregate the predictors
I In the regression case (average),
B
1X
fbag (x) = fb (x) (1)
B
b=1

I In the classification case (majority vote over trees),


B
!
1X
fbag (x) = argmax 1{fb (x)=c} (2)
16c6C B
b=1

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 8 / 33
Motivations Bagging Random forests Extremely randomized trees References

Bagging diagram

Figure: Illustration of the bagging principle (with LΘ


n = Db , ĥ(., Θb ) = f̂b and
l

ĥbag = f̂bag )

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 9 / 33
Motivations Bagging Random forests Extremely randomized trees References

Random forests

Method introduced by Leo


Breiman in 2001
[Breiman, 2001]
Based on older ideas:
Bagging [Breiman, 1996],
decision trees CART
[Breiman et al., 1984]
Proofs of the convergence
[Biau et al., 2008]
Random forests method
belongs to the family of
ensemble methods

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 10 / 33
Motivations Bagging Random forests Extremely randomized trees References

Random forests (notations)


D = {(x1 , y1 ) , . . . , (xn , yn )} the learning set, each (xi , yi ) is
independent, realization from the random variables (X , Y )
X ∈ Rd the input variables; Y ∈ Y the output variable, Y = R for
regression and Y = {1, . . . , C} for classification
Goal: build a predictor f̂ : Rd → Y
Random forests idea
n o
f̂b (., Θb ) , 1 6 b 6 B set of decision tree predictors,
(Θb )16b6B characterises the bth tree in terms of split variables,
cutpoints at each node and terminal-node value,
Random forests predictor f̂ obtained by aggregating the set of trees
B
for regression, 1X
f̂ (x) := f̂b (x, Θb ) (3)
B
b=1
for classification, B
!
1X
f̂ (x) := argmax 1{f̂ (x,Θ )=c} (4)
16c6C B b b
b=1
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 11 / 33
Motivations Bagging Random forests Extremely randomized trees References

Random forests
Random forests consist of growing a large number (ex: 400) of
randomly constructed decision trees before aggregating them
In statistical terms, if the trees are decorrelated, this reduces the
variance of the predictions
Naïve example with 2 trees,

24.7+23.3
Averaging these regression trees allows the prediction 2 = 24

source: Arbres de décision et forêts aléatoires, P. Gaillard, 2014


M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 12 / 33
Motivations Bagging Random forests Extremely randomized trees References

The problem of correlation between trees

Bagging idea: aggregate many noisy but approximately unbiased1


tree models to reduce the variance
However, there is necessarily some overlap between
bootstrapped datasets
⇒ the trees corresponding to each of them are correlated
Intuition: if B trees fb (x) are identically distributed, of variance σ 2 ,
with a correlation coefficient ρ = Corr (fb (x), fb0 (x)), ∀b 6= b0 , the
variance of their mean is then,

(1 − ρ) σ 2
V fbag (X ) = ρσ 2 +

(5)
B
Thus, the variance cannot be shrink below ρσ 2
⇒ Disadvantage of bagging
1
if sufficiently deep
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 13 / 33
Motivations Bagging Random forests Extremely randomized trees References

Create low correlated trees

We saw the disadvantage of


Bootstrapping: rather than using all the data to build the trees, we
choose randomly for each tree a subset (with possible repetition)
of the training data.

Let introduce the improvement proposed by random forests: is to lower


the correlation between trees (without increasing the variance too
much) using an additional step of randomization,
a random choice of the input feature j used to split each node
during the tree-growing process

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 14 / 33
Motivations Bagging Random forests Extremely randomized trees References

Definition of Random forests

Definition (Random forests)


A random forest is a set of trees growing on a boostrapped learning
data set and where, before each split, a set of m  d input variables
(or features) is randomly selected as candidates for splitting

Note: m is the same for all the nodes of all the trees of the forest but, of
course, the variables considered in each node for the choice of the
best split changes randomly

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 15 / 33
Motivations Bagging Random forests Extremely randomized trees References

Algorithm of Random forests


n o
1: Require: A dataset D = (xi , yi )16i6n , the size B of the ensemble, the number
m of candidates (features) for splitting
2: for b = 1 to B do
3: Draw a bootstrap dataset Db of size n from the original training set D
4: Grow a random tree f̂b using the bootstrapped dataset:
5: repeat
6: for all terminal node do
7: Select m variables among d, at random
8: Pick the best variable and split-point couple among the m
9: Split the node into two child nodes
10: end for
11: until the stopping criterion is met (e.g., minimum number of sample per node
reached)
12: end for
return: the ensemble of B trees
Algorithm 1: Pseudo-code to build Random forests for regression or clas-
sification

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 16 / 33
Motivations Bagging Random forests Extremely randomized trees References

Random forests in practice

Intuitively, reducing m will reduce the correlation between any


pairs of trees in the ensemble
⇒ reduce the variance of the average (cf. Eq. (5))
However, the corresponding hypothesis space will be smaller,
leading to an increased bias
Heuristics,
for regression, choose m = b d3 c and a minimum node size of 5

for classification, choose m = b dc and a minimum node size of 1

For further information about random forests, you can refer to


[Hastie et al., 2009] (Chap. 15) that provides a bias-variance analysis

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 17 / 33
Motivations Bagging Random forests Extremely randomized trees References

OOB-error (Out-Of-Bag error)

OOB error (out of bootstrap samples) - the principle


To predict yi , we only aggregate the predictors f̂b (., Θb ) built on
bootstrap samples not containing (xi , yi )
for regression, n
1X
OOB-error = (yi − ŷi )2
n
i=1

where ŷi from an aggregation of f̂b built on Db \ (xi , yi )


for classification,
n
1X
OOB-error = 1yi 6=ŷi
n
i=1

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 18 / 33
Motivations Bagging Random forests Extremely randomized trees References

OOB-error

Many advantages,
An OOB-error estimate is almost identical to that obtained by
N-fold cross-validation
Random forests can be fit in one sequence, with cross-validation
being performed along the way
Once the OOB-error stabilizes, the training can be terminated and
B value is obtained/ tunned

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 19 / 33
Motivations Bagging Random forests Extremely randomized trees References

Variable Importance (VI)

Random Forests (RF) allow to rank the explanatory variables in


order of importance for the prediction
In the RF framework, permutation importance indices are
preferred to total decrease of node impurity measures already
introduced in Breimanet al.(1984)
The default Scikit-learn’s feature importances is based on the
decrease of node impurity

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 20 / 33
Motivations Bagging Random forests Extremely randomized trees References

Variable Importance based on impurity decrease


For one tree,
Variable Importance (VI) of Xj is calculated by the sum of the
decrease in error/impurity when split by the variable Xj
e.g., if Xj is used 2 times to split a terminal node in the tree → you
will sum these two decreases in Gini index (or cross-entropy, etc)
to obtain its VI

The relative importance is the VI divided by the highest VI value


(normalization)
⇒ Values are bounded between 0 and 1

In the case of RF,


we are talking about averaging the decrease in impurity over
trees
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 21 / 33
Motivations Bagging Random forests Extremely randomized trees References

Variable Importance based on impurity decrease

Pros,
Fast calculation
Easy to obtain via Scikit-learn: one command
feature_importances_
Cons,
Biased approach: it has a tendency to inflate the importance of
continuous features or high-cardinality categorical variables

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 22 / 33
Motivations Bagging Random forests Extremely randomized trees References

Variable Importance based on permutation


This approach directly measures the feature importance of an input
variable Xj by observing how random permutation of its values (thus
preserving the distribution of the variable) influences model
performance
Goal: measure the prediction strength of each variable

Figure: Illustration of the values permutation from one variable (source: Arbres CART et
Forêts aléatoires - Importance et sélection de variables, R. Genuer and J-M. Poggi, 2016)
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 23 / 33
Motivations Bagging Random forests Extremely randomized trees References

Variable Importance based on permutation

The process is the following,


1 Grow the RF on the learning set
2 Record the OOB-error E
3 Permute at random the j-th variable values of these data
4 Pass this modified dataset to the RF again to obtain predictions
5 Compute the OOB-error on this modified dataset
6 The VI of Xj is the difference between the benchmark score E and
the one from the modified (permuted) dataset

⇒ The more the increase of OOB error is, the more important is the
variable

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 24 / 33
Motivations Bagging Random forests Extremely randomized trees References

Variable Importance based on permutation

Pros,
Reasonably efficient
Reliable technique
No need to re-train the model at each modification of the dataset
Cons,
More computationally expensive than the default
feature_importances_
Permutation importance overestimates the importance of
correlated predictors [Strobl et al., 2008]

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 25 / 33
Motivations Bagging Random forests Extremely randomized trees References

Anomalies detection

RFs are well suited to detecting outliers [Liu et al., 2008]


These are indeed quickly isolated in a separate leaf
The anomaly score of an observation xi is determined
approximately by the average length of the path from xi to the
leaves of trees in the forest
The shorter the path, the more likely the observation is atypical

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 26 / 33
Motivations Bagging Random forests Extremely randomized trees References

Pros and cons of Random forests

Pros
no over-learning
usually: better performance than decision trees
direct computation of the"Out-of-Bag" error: cross validation not
required
hyper-parameters (B, m) easy to tune

Cons
black box: difficult to interpret
slower training

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 27 / 33
Motivations Bagging Random forests Extremely randomized trees References

Extremely randomized trees

Randomization can be pushed further with extremely randomized


forests
Method introduced by [Geurts et al., 2006]
It is a RF with two differences,
I m < d of the input variables are selected at random and for each of
these variables a split point is chosen at random
I The full learning set D is used to growth each tree (instead of a
bootstrapped learning set (Db )1≤b≤B )

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 28 / 33
Motivations Bagging Random forests Extremely randomized trees References

Impact on correlation and bias

Theses two differences:


1 Using the full learning set
⇒ achieve a lower bias
But, the price is an increased variance

That should be compensated by the randomization of split-points,


2 choosing also the split-point at random
⇒ reduce the correlation between trees to reduce the variance of
the average of the ensemble more strongly

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 29 / 33
Motivations Bagging Random forests Extremely randomized trees References

Algorithm of Extremely Randomized


n o
Forest
1: Require: A dataset D = (xi , yi )16i6n , the size B of the ensemble, the number
m of candidates for splitting
2: for b = 1 to B do
3: Grow a random tree using the original dataset D:
4: repeat
5: for all terminal node do
6: Select m variables among d, at random
7: for all sampled variables do
8: Select a split at random
9: end for
10: Pick the best variable and split-point couple among the m candidates
11: Split the node into two child nodes
12: end for
13: until the stopping criterion is met (e.g., minimum number of sample per node
reached)
14: end for
return: the ensemble of B trees
Algorithm 2: Pseudo-code describing Extremely Randomized Forest ap-
proach
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 30 / 33
Motivations Bagging Random forests Extremely randomized trees References

Extremely Randomized Forest

Advantages of this approach,


Empirically, it often provides better results than RFs
Lower computational complexity compared to RFs (one chooses
the split among the m randomly drawn split-points)
Disadvantages
Black box: difficult to interpret

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 31 / 33
Motivations Bagging Random forests Extremely randomized trees References

References I
Biau, G., Devroye, L., and Lugosi, G. (2008).
Consistency of random forests and other averaging classifiers.
Journal of Machine Learning Research, 9(Sep):2015–2033.

Breiman, L. (1996).
Bagging predictors.
Machine learning, 24(2):123–140.

Breiman, L. (2001).
Random forests.
Machine learning, 45(1):5–32.

Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984).


Classification and regression trees.
CRC press.

Geurts, P., Ernst, D., and Wehenkel, L. (2006).


Extremely randomized trees.
Machine learning, 63(1):3–42.

Hastie, T., Tibshirani, R., and Friedman, J. (2009).


The elements of statistical learning: data mining, inference, and prediction.
Springer Science & Business Media.

Liu, F. T., Ting, K. M., and Zhou, Z.-H. (2008).


Isolation forest.
In 2008 Eighth IEEE International Conference on Data Mining, pages 413–422. IEEE.

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 32 / 33
Motivations Bagging Random forests Extremely randomized trees References

References II

Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. (2008).
Conditional variable importance for random forests.
BMC bioinformatics, 9(1):307.

Tibshirani, R. J. and Efron, B. (1993).


An introduction to the bootstrap.
Monographs on statistics and applied probability, 57:1–436.

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2020 33 / 33

You might also like