Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views66 pages

Boosted Trees

Machine Learning Boosted Trees

Uploaded by

yabaza71
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views66 pages

Boosted Trees

Machine Learning Boosted Trees

Uploaded by

yabaza71
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

ICS 603: Advanced Machine Learning

Lecture 2 Part 1
Ensemble Learning: Boosted Trees

Dr. Caroline Sabty


[email protected]
Faculty of Informatics and Computer Science
German International University in Cairo
Acknowledgment

The course and the slides are based on the slides of Dr. Seif Eldawlatly and
based on the course created by Prof. Jose Portilla
Boosting

● We’ve learned about single Decision Trees and have seeked to


improve upon them with Random Forest models.
● Let’s now explore another methodology of seeking to improve on the
single decision tree, known as boosting.
Boosting

● Outline:
○ Boosting and Meta-Learning
○ AdaBoost (Adaptive Boosting) Theory
○ Example of AdaBoost
○ Gradient Boosting Theory
● Related Reading:
■ ISLR: Section 8.2.3
○ Relevant Wikipedia Articles:
○ wikipedia.org/wiki/Boosting_(machine_learning)
○ wikipedia.org/wiki/AdaBoost
Boosting

• Boosting is an iterative procedure used to adaptively change the distribution


of training examples so that the base classifiers will focus on examples that
are hard to classify

• Initially, the examples are assigned equal weights, 1/N, so that they are
equally likely to be chosen for training

• A sample is drawn according to the sampling distribution of the training


examples to obtain a new training set

• Next, a classifier is induced from the training set and used to classify all the
examples in the original data

• The weights of the training examples are updated at the end of each
boosting round
Boosting

• Examples that are classified incorrectly will have their weights increased

• Those that are classified correctly will have their weights unchanged or
decreased

• This forces the classifier to focus on examples that are difficult to classify in
subsequent iterations
Boosting

Bagging Boosting

Classifier Classifier Classifier Classifier Classifier Classifier

Majority Weighted
Voting Voting

Prediction Prediction
Boosting

● The concept of boosting is not actually a machine learning algorithm,


it is methodology applied to an existing machine learning algorithm,
most commonly applied to the decision tree.
● Let’s explore this idea of a meta-learning algorithm by reviewing a
simple application and formula.
● Main formula for boosting (the overall boosted meta-learning
model):
● Main formula for boosting (process of aggregating a bunch of weak
learners into a strong ensemble learner):

● Implies that a combination of estimators with an applied coefficient


could act as an effective ensemble estimator.

● Note h(x) can in theory be any machine learning algorithm


(estimator/learner) (e.g., decision tree).
● Main formula for boosting:

● Can an ensemble of weak learners (very simple version of model) be


a strong learner when combined?
● Main formula for boosting:

● For decision tree models, we can use simple trees in place of h(x)
and combine them with the coefficients on each model.
AdaBoost

● AdaBoost (Adaptive Boosting) works by using an ensemble of weak


learners and then combining them through the use of a weighted
sum.
● Adaboost adapts by using previously created weak learners in
order to adjust misclassified instances for the next created weak
learner (not parallel creation of trees, adapted in serious).

13
AdaBoost

● What is a weak learner?


○ A weak model is a model that is too simple to perform well on
its own.
○ The weakest decision tree possible would be a stump, one
node and two leaves (the weak learner)

14
AdaBoost

● Unlike a single decision tree which fits to all the data at once (fitting
the data hard), AdaBoost aggregates multiple weak learners,
allowing the overall ensemble model to learn slowly (tree by tree)
from the features.
● Let’s first understand how this works from a data perspective

15
● Imagine a classification task:
● What would a stump classification look like?

X2

X1

16
● What would a stump classification look like?

X
X2

X1

17
● What would a stump classification look like?

X1
X2

X1

18
● What would a stump classification look like?

X1
X2

X1

19
● What would a stump classification look like?

X2
X2

X1

20
● What would a stump classification look like?

X2
X2

X1

21
● How can we combine stumps? How to improve performance with
an ensemble (meta-learning)?

22
● AdaBoost Process:

● Main Formulas
● Algorithmic Steps
● Visual Walkthrough of Algorithm

23
● AdaBoost Process: Main Formulas
Number of learners (tress)

weighting factor (we want


it high if the weak learner
is good at predicting)

24
● AdaBoost Process:

Error term we want to


minimize: Sum of training
error

25
● AdaBoost Process:

Update weights for each point

26
● AdaBoost Process: Algorithm Steps

27
● AdaBoost Process:

We calculate the actual influence for


this classifier in classifying the data
points using this formula
Recall: Alpha is how much influence
this stump will have in the final
classification 28
● AdaBoost Process: Set of all weak learners already created
and we add to it this new one

Updated weights are being


used by next week learner

29
● AdaBoost Process: Visual Walkthrough

30
Visual Walkthrough

● AdaBoost Process:

X1

31
● AdaBoost Process:

X1
αt

32
● AdaBoost Process: Use the updated set of weights to proceed and
build another weak learner

X1 X2
αt

33
● AdaBoost Process:

X1 X2
αt αt+1

34
● AdaBoost Process: Adjust the weights based on new
predictions

X1 X2
αt αt+1

35
● AdaBoost Process:

X1 X2
αt αt+1

36
● AdaBoost Process:

X1 X2 X1
αt αt+1

37
● AdaBoost Process:

X1 X2 X1
αt αt+1

38
● AdaBoost Process:

X1 X2 X1
αt αt+1 αt+2

39
● AdaBoost Process:

X1 X2 X1
αt αt+1 αt+2

40
● AdaBoost Process:

X1 X2 X1
αt αt+1 αt+
2
We sum all the predictions together

41
● AdaBoost Process:

X1 X2 X1
αt αt αt
+1 +2

42
● AdaBoost Process:

X1 X2 X1
αt αt+1 αt+
2

43
● AdaBoost Process:

X1 X2 X1
αt αt+1 αt+2

44
● AdaBoost uses an ensemble of weak learners that learn slowly in
series.
● Certain weak learners have more “say” in the final output than
others due to the multiplied alpha parameter.
● Each subsequent t weak learner is built using a reweighted data set
from the t-1 weak learner.

45
● Intuition of Adaptive Boosting:
○ Each stump essentially represents the strength of a feature to
predict.
○ Building these stumps in series and adding in the alpha
parameter allows us to intelligently combine the importance
of each feature together.

46
● Notes on Adaptive Boosting:
○ Unlike Random Forest, it is possible to overfit with AdaBoost,
however it takes many trees to do this.
○ Usually error has already stabilized way before enough trees
are added to cause overfitting.

47
Gradient Boosting

● Gradient Boosting is a very similar idea to AdaBoost, where weak


learners are created in series in order to produce a strong ensemble
model.
● Gradient Boosting makes use of the residual error for learning.
● Gradient Boosting vs. Adaboost:
○ Larger Trees allowed in Gradient Boosting.
○ Learning Rate coefficient same for all weak learners.
○ Gradual series learning is based on training on the residuals of
the previous model.

48
● Gradient Boosting Regression Example

Area m2 Bedrooms Bathrooms Price

200 3 2 $500,000

190 2 1 $462,000

230 3 3 $565,000

49
● Train a decision tree on data
● Note - not just a stump!
● Get predicted ŷ value

Area m2 Bedrooms Bathrooms Price

200 3 2 $500,000

190 2 1 $462,000

230 3 3 $565,000

50
● Get predicted ŷ value

y ŷ

$500,000 $509,000

$462,000 $509,000

$565,000 $509,000

51
● Calculate residual: e = y-ŷ

y ŷ e
$500,000 $509,000 -$9,000

$462,000 $509,000 -$47,000

$565,000 $509,000 $56,000

52
● Create new model to predict the error

y ŷ e
$500,000 $509,000 -$9,000

$462,000 $509,000 -$47,000

$565,000 $509,000 $56,000

53
● Create new model to predict the error

y ŷ e

$500,000 $509,000 -$9,000

$462,000 $509,000 -$47,000

$565,000 $509,000 $56,000

54
● Create new model to predict the error

y ŷ e f1

$500,000 $509,000 -$9,000 -$8,000

$462,000 $509,000 -$47,000 -$50,000

$565,000 $509,000 $56,000 $50,000

55
● Create new model to predict the error

y ŷ e f1
$500,000 $509,000 -$9,000 -$8,000

$462,000 $509,000 -$47,000 -$50,000

Area m2 Bedroo Bathroo


ms ms
$565,000 $509,000 $56,000 $50,000 200 3 2

190 2 1

230 3 3 56
● Create new model to predict the error

y ŷ e f1
$500,000 $509,000 -$9,000 -$8,000

$462,000 $509,000 -$47,000 -$50,000

Area m2 Bedroo Bathroo


ms ms
$565,000 $509,000 $56,000 $50,000
200 3 2

190 2 1

230 3 3 57
● Update prediction using error prediction

y ŷ e f1
$500,000 $509,000 -$9,000 -$8,000

$462,000 $509,000 -$47,000 -$50,000

$565,000 $509,000 $56,000 $50,000

58
● Update prediction using error prediction

y ŷ e f1 F1 = ŷ +
f1

$500,000 $509,000 -$9,000 -$8,000

$462,000 $509,000 -$47,000 -$50,000

$565,000 $509,000 $56,000 $50,000

59
● Update prediction using error prediction
● We can continue this process in series

y ŷ e f1 F1 = ŷ +
f1

$500,000 $509,000 -$9,000 -$8,000 $501,000

$462,000 $509,000 -$47,000 -$50,000 $459,000

$565,000 $509,000 $56,000 $50,000 $559,000

60
● Gradient Boosting Process

Same across all the models

61
● Gradient Boosting Process
○ Create initial model: f0
○ Train another model on error
■ e = y - f0
○ Create new prediction
■ F1 = f0 + ηf1
○ Repeat as needed
■ Fm = fm-1 + ηfm

62
● Note, for classification we can use the logit as an error metric:

63
● Note, the learning rate is the same for each new model in the
series, it is not unique to each subsequent model (unlike
AdaBoost’s alpha coefficient).
● Gradient Boosting is fairly robust to overfitting, allowing for the
number of estimators to be set high be default (~100).

64
● Gradient Boosting Intuition
○ We optimize the series of trees by learning on the
residuals, forcing subsequent trees to attempt to correct
for the error in the previous trees.

65
● Gradient Boosting Intuition
○ The trade-off is training time.
○ A learning rate is between 0-1, which means a very low
value would mean each subsequent tree has little “say”,
meaning more trees need to be created, causing a longer
computational training time.

66

You might also like