Valentina Alto
Dec 31, 2019
·
5 min read
·
Member-only
·
Listen
Ensemble Methods for Decision Trees
Decision Trees are popular Machine Learning algorithms used for
both regression and classification tasks. Their popularity mainly
arises from their interpretability and representability, as they mimic
the way the human brain takes decisions.
However, to be interpretable, they pay a price in terms of prediction
accuracy. To overcome this caveat, some techniques have been
developed, with the goal of creating strong and robust models
starting from ‘poor’ models. Those techniques are known as
‘ensemble’ methods and, in this article, I’m going to talk about three
of them: Bagging, Random Forest and Boosting.
Bagging
The idea of bagging is that, if we were able to train on different
datasets multiple trees and then use an average (or, in case of
classification, the majority vote) of their output to predict the label of
a new observation, we would get more accurate results. Namely,
imagine we train on four different datasets, drawn from the same
population, 4 decision trees which are meant to classify an e-mail as
spam or not spam. Then a new e-mail arrives and three of them
classify it as spam, while one of them as not spam.
Thanks to the distribution of the outputs, we can be almost sure that
the e-mail is spam, however, if we had only trained one tree, maybe
on the dataset which returned a not-spam, we would have had a
wrong prediction.
Nevertheless, it is often not feasible to have different datasets
available. Hence, a solution to this problem is bootstrapping. The
idea is that, given an initial dataset, we draw from it (with
replacement) B new samples of the same size, and then train on each
of them a new tree.
Then, the procedure is the same as that described above: when a new
test observation arrives, we feed all the tree with it and then take the
average response (for regression) or the class with the majority vote
(for classification).
With this procedure, we can also estimate the test error in order to
evaluate our full model. Indeed, when we proceed with
bootstrapping, it turns out that, on average, each bootstrapped
sample contains about 2/3 of the original dataset. Hence, since the
goal of test error estimation is feeding the model with data which
have never been seen before, we can use the so-called ‘out of bag’
(OOB) portion of our dataset to test the model, for each
bootstrapped sample. So, if B is the number of samples drawn from
the original dataset, each decision tree will be trained on about 2/3B
of the original data, and tested on about B/3.
Random Forest
Random forest relies on the same intuition as bagging, however with
an additional constraint. Indeed, every time a tree is grown from a
bootstrapped sample, the algorithm allows it to consider only a
subset of size m of the entire covariates spaces of size p (with m<p).
By doing so, each tree is independent of each other.
The reason why it might be more accurate than bagging might be
sized with the following example. Imagine we are using the bagging
algorithm and that, among the predictors, there is a particular one
whose splitting decreases a lot the Gini index (in other words, it
makes the node more ‘pure’, hence it converges towards the final
output). If this is the case, all the trees built on the B samples will
probably elect that predictor as the root of the tree. As a result, all
trees will be similar to each other and the final prediction might be
less accurate.
Now, if the same task is performed using Random Forest, there will
be some samples where the above predictor will be excluded from
the training phase, hence that tree will be different from the others.
So thanks to this constraint, the model is able to explore many more
possible combinations of parameters, without being ‘bounded’ to
best predictors.
Boosting
With boosting, the strategy is a bit different. Indeed, the idea behind
boosting is building a series of trees, each of those being an updated
version of the previous one. We can say that, with boosting, we size
the very essence of Machine Learning algorithms, since the way each
tree is built is by learning from its past errors.
Basically, we start with a very naive model, called f, which predicts 0
for every input, so that actual values=residuals (indeed, residuals are
the difference between actual and fitted values, and fitted values =
0). Then, for each iteration 1,2…. up to a previously set limit ‘B’:
• we fit a tree f(b) on the residuals rather than the output
variable Y. We can choose as many nodes as we fancy,
however in this phase, to keep things simple, it is a rule of
thumb to choose only one split (hence two terminal nodes).
Such a tree is called a stump.
• we update the previous tree by adding a shrunken version of
the tree obtained above:
f = f + lambda*f(b)
where lambda is a shrinkage parameter that has the aim of letting
the algorithm learning slowly, which leads to higher accuracy.
• we update the residuals as well:
residuals = residuals — lambda*f(b)
By doing so iteratively, at each step, the final tree will be updated by
a new stump fitted on the updated residuals, which will be more and
more low.
The advantage of this procedure is that, since it does not involve
building a large and complex tree for each sample, it can prevent
overfitting (provided that the number of iterations B is not too
large). Plus, as it learns slowly (with lambda normally set between
0.01 and 0.001), it guarantees a good fit of the tree without the risk
of over parametrized it, so again it prevents the tree from being
overfitted.
Conclusion
Ensemble methods are powerful techniques that can largely improve
the predictive accuracy of decision trees. The caveat of those,
however, is that they make a bit less easy to present and interpret the
final results. Indeed, as we said at the very beginning, the most
appreciable feature of decision trees is their interpretability and
easiness to understand, which, in a world of algorithms which look
like ‘black box’, is an important value. Nevertheless, some visual
alternative options are available and, if ensembling multiple trees
improve that much accuracy, it is for sure worth it.