ICS 603: Advanced Machine Learning
Lecture 2 Part 1
Ensemble Learning: Boosted Trees
Dr. Caroline Sabty
[email protected] Faculty of Informatics and Computer Science
German International University in Cairo
Acknowledgment
The course and the slides are based on the slides of Dr. Seif Eldawlatly and
based on the course created by Prof. Jose Portilla
Boosting
● We’ve learned about single Decision Trees and have seeked to
improve upon them with Random Forest models.
● Let’s now explore another methodology of seeking to improve on the
single decision tree, known as boosting.
Boosting
● Outline:
○ Boosting and Meta-Learning
○ AdaBoost (Adaptive Boosting) Theory
○ Example of AdaBoost
○ Gradient Boosting Theory
● Related Reading:
■ ISLR: Section 8.2.3
○ Relevant Wikipedia Articles:
○ wikipedia.org/wiki/Boosting_(machine_learning)
○ wikipedia.org/wiki/AdaBoost
Boosting
• Boosting is an iterative procedure used to adaptively change the distribution
of training examples so that the base classifiers will focus on examples that
are hard to classify
• Initially, the examples are assigned equal weights, 1/N, so that they are
equally likely to be chosen for training
• A sample is drawn according to the sampling distribution of the training
examples to obtain a new training set
• Next, a classifier is induced from the training set and used to classify all the
examples in the original data
• The weights of the training examples are updated at the end of each
boosting round
Boosting
• Examples that are classified incorrectly will have their weights increased
• Those that are classified correctly will have their weights unchanged or
decreased
• This forces the classifier to focus on examples that are difficult to classify in
subsequent iterations
Boosting
Bagging Boosting
Classifier Classifier Classifier Classifier Classifier Classifier
Majority Weighted
Voting Voting
Prediction Prediction
Boosting
● The concept of boosting is not actually a machine learning algorithm,
it is methodology applied to an existing machine learning algorithm,
most commonly applied to the decision tree.
● Let’s explore this idea of a meta-learning algorithm by reviewing a
simple application and formula.
● Main formula for boosting (the overall boosted meta-learning
model):
● Main formula for boosting (process of aggregating a bunch of weak
learners into a strong ensemble learner):
● Implies that a combination of estimators with an applied coefficient
could act as an effective ensemble estimator.
● Note h(x) can in theory be any machine learning algorithm
(estimator/learner) (e.g., decision tree).
● Main formula for boosting:
● Can an ensemble of weak learners (very simple version of model) be
a strong learner when combined?
● Main formula for boosting:
● For decision tree models, we can use simple trees in place of h(x)
and combine them with the coefficients on each model.
AdaBoost
● AdaBoost (Adaptive Boosting) works by using an ensemble of weak
learners and then combining them through the use of a weighted
sum.
● Adaboost adapts by using previously created weak learners in
order to adjust misclassified instances for the next created weak
learner (not parallel creation of trees, adapted in serious).
13
AdaBoost
● What is a weak learner?
○ A weak model is a model that is too simple to perform well on
its own.
○ The weakest decision tree possible would be a stump, one
node and two leaves (the weak learner)
14
AdaBoost
● Unlike a single decision tree which fits to all the data at once (fitting
the data hard), AdaBoost aggregates multiple weak learners,
allowing the overall ensemble model to learn slowly (tree by tree)
from the features.
● Let’s first understand how this works from a data perspective
15
● Imagine a classification task:
● What would a stump classification look like?
X2
X1
16
● What would a stump classification look like?
X
X2
X1
17
● What would a stump classification look like?
X1
X2
X1
18
● What would a stump classification look like?
X1
X2
X1
19
● What would a stump classification look like?
X2
X2
X1
20
● What would a stump classification look like?
X2
X2
X1
21
● How can we combine stumps? How to improve performance with
an ensemble (meta-learning)?
22
● AdaBoost Process:
● Main Formulas
● Algorithmic Steps
● Visual Walkthrough of Algorithm
23
● AdaBoost Process: Main Formulas
Number of learners (tress)
weighting factor (we want
it high if the weak learner
is good at predicting)
24
● AdaBoost Process:
Error term we want to
minimize: Sum of training
error
25
● AdaBoost Process:
Update weights for each point
26
● AdaBoost Process: Algorithm Steps
27
● AdaBoost Process:
We calculate the actual influence for
this classifier in classifying the data
points using this formula
Recall: Alpha is how much influence
this stump will have in the final
classification 28
● AdaBoost Process: Set of all weak learners already created
and we add to it this new one
Updated weights are being
used by next week learner
29
● AdaBoost Process: Visual Walkthrough
30
Visual Walkthrough
● AdaBoost Process:
X1
31
● AdaBoost Process:
X1
αt
32
● AdaBoost Process: Use the updated set of weights to proceed and
build another weak learner
X1 X2
αt
33
● AdaBoost Process:
X1 X2
αt αt+1
34
● AdaBoost Process: Adjust the weights based on new
predictions
X1 X2
αt αt+1
35
● AdaBoost Process:
X1 X2
αt αt+1
36
● AdaBoost Process:
X1 X2 X1
αt αt+1
37
● AdaBoost Process:
X1 X2 X1
αt αt+1
38
● AdaBoost Process:
X1 X2 X1
αt αt+1 αt+2
39
● AdaBoost Process:
X1 X2 X1
αt αt+1 αt+2
40
● AdaBoost Process:
X1 X2 X1
αt αt+1 αt+
2
We sum all the predictions together
41
● AdaBoost Process:
X1 X2 X1
αt αt αt
+1 +2
42
● AdaBoost Process:
X1 X2 X1
αt αt+1 αt+
2
43
● AdaBoost Process:
X1 X2 X1
αt αt+1 αt+2
44
● AdaBoost uses an ensemble of weak learners that learn slowly in
series.
● Certain weak learners have more “say” in the final output than
others due to the multiplied alpha parameter.
● Each subsequent t weak learner is built using a reweighted data set
from the t-1 weak learner.
45
● Intuition of Adaptive Boosting:
○ Each stump essentially represents the strength of a feature to
predict.
○ Building these stumps in series and adding in the alpha
parameter allows us to intelligently combine the importance
of each feature together.
46
● Notes on Adaptive Boosting:
○ Unlike Random Forest, it is possible to overfit with AdaBoost,
however it takes many trees to do this.
○ Usually error has already stabilized way before enough trees
are added to cause overfitting.
47
Gradient Boosting
● Gradient Boosting is a very similar idea to AdaBoost, where weak
learners are created in series in order to produce a strong ensemble
model.
● Gradient Boosting makes use of the residual error for learning.
● Gradient Boosting vs. Adaboost:
○ Larger Trees allowed in Gradient Boosting.
○ Learning Rate coefficient same for all weak learners.
○ Gradual series learning is based on training on the residuals of
the previous model.
48
● Gradient Boosting Regression Example
Area m2 Bedrooms Bathrooms Price
200 3 2 $500,000
190 2 1 $462,000
230 3 3 $565,000
49
● Train a decision tree on data
● Note - not just a stump!
● Get predicted ŷ value
Area m2 Bedrooms Bathrooms Price
200 3 2 $500,000
190 2 1 $462,000
230 3 3 $565,000
50
● Get predicted ŷ value
y ŷ
$500,000 $509,000
$462,000 $509,000
$565,000 $509,000
51
● Calculate residual: e = y-ŷ
y ŷ e
$500,000 $509,000 -$9,000
$462,000 $509,000 -$47,000
$565,000 $509,000 $56,000
52
● Create new model to predict the error
y ŷ e
$500,000 $509,000 -$9,000
$462,000 $509,000 -$47,000
$565,000 $509,000 $56,000
53
● Create new model to predict the error
y ŷ e
$500,000 $509,000 -$9,000
$462,000 $509,000 -$47,000
$565,000 $509,000 $56,000
54
● Create new model to predict the error
y ŷ e f1
$500,000 $509,000 -$9,000 -$8,000
$462,000 $509,000 -$47,000 -$50,000
$565,000 $509,000 $56,000 $50,000
55
● Create new model to predict the error
y ŷ e f1
$500,000 $509,000 -$9,000 -$8,000
$462,000 $509,000 -$47,000 -$50,000
Area m2 Bedroo Bathroo
ms ms
$565,000 $509,000 $56,000 $50,000 200 3 2
190 2 1
230 3 3 56
● Create new model to predict the error
y ŷ e f1
$500,000 $509,000 -$9,000 -$8,000
$462,000 $509,000 -$47,000 -$50,000
Area m2 Bedroo Bathroo
ms ms
$565,000 $509,000 $56,000 $50,000
200 3 2
190 2 1
230 3 3 57
● Update prediction using error prediction
y ŷ e f1
$500,000 $509,000 -$9,000 -$8,000
$462,000 $509,000 -$47,000 -$50,000
$565,000 $509,000 $56,000 $50,000
58
● Update prediction using error prediction
y ŷ e f1 F1 = ŷ +
f1
$500,000 $509,000 -$9,000 -$8,000
$462,000 $509,000 -$47,000 -$50,000
$565,000 $509,000 $56,000 $50,000
59
● Update prediction using error prediction
● We can continue this process in series
y ŷ e f1 F1 = ŷ +
f1
$500,000 $509,000 -$9,000 -$8,000 $501,000
$462,000 $509,000 -$47,000 -$50,000 $459,000
$565,000 $509,000 $56,000 $50,000 $559,000
60
● Gradient Boosting Process
Same across all the models
61
● Gradient Boosting Process
○ Create initial model: f0
○ Train another model on error
■ e = y - f0
○ Create new prediction
■ F1 = f0 + ηf1
○ Repeat as needed
■ Fm = fm-1 + ηfm
62
● Note, for classification we can use the logit as an error metric:
63
● Note, the learning rate is the same for each new model in the
series, it is not unique to each subsequent model (unlike
AdaBoost’s alpha coefficient).
● Gradient Boosting is fairly robust to overfitting, allowing for the
number of estimators to be set high be default (~100).
64
● Gradient Boosting Intuition
○ We optimize the series of trees by learning on the
residuals, forcing subsequent trees to attempt to correct
for the error in the previous trees.
65
● Gradient Boosting Intuition
○ The trade-off is training time.
○ A learning rate is between 0-1, which means a very low
value would mean each subsequent tree has little “say”,
meaning more trees need to be created, causing a longer
computational training time.
66