Trịnh Tấn Đạt
Khoa CNTT – Đại Học Sài Gòn
Email:
[email protected]Website: https://sites.google.com/site/ttdat88/
Contents
Introduction
Voting
Bagging
Boosting
Stacking and Blending
Introduction
Definition
An ensemble of classifiers is a set of classifiers whose individual decisions
are combined in some way (typically, by weighted or un-weighted voting)
to classify new examples
Ensembles are often much more accurate than the individual classifiers that
make them up.
Learning Ensembles
Learn multiple alternative definitions of a concept using different training
data or different learning algorithms.
Combine decisions of multiple definitions, e.g. using voting.
Training Data
Data 1 Data 2 Data K
Learner 1 Learner 2 Learner K
Model 1 Model 2 Model K
Model Combiner Final Model
Necessary and Sufficient Condition
For the idea to work, the classifiers should be
Accurate
Diverse
Accurate: Has an error rate better than random guessing on new instances
Diverse: They make different errors on new data points
Why they Work?
Suppose there are 25 base classifiers
Each classifier has an error rate, = 0.35
Assume classifiers are independent
Probability that the ensemble classifier makes a wrong prediction:
25
25 i
i
i =13
(1 − ) 25 − i
= 0.06
Marquis de Condorcet (1785) Majority vote is wrong with probability:
Value of Ensembles
When combing multiple independent and diverse decisions each of which is at
least more accurate than random guessing, random errors cancel each other out,
correct decisions are reinforced.
Human ensembles are demonstrably better
How many jelly beans in the jar?: Individual estimates vs. group average.
A Motivating Example
Suppose that you are a patient with a set of symptoms
Instead of taking opinion of just one doctor (classifier), you decide to take
opinion of a few doctors!
Is this a good idea? Indeed it is.
Consult many doctors and then based on their diagnosis; you can get a fairly
accurate idea of the diagnosis.
The Wisdom of Crowds
The collective knowledge of a diverse and independent body of people
typically exceeds the knowledge of any single individual and can be
harnessed by voting
When Ensembles Work?
Ensemble methods work better with ‘unstable classifiers’
Classifiers that are sensitive to minor perturbations in the training set
Examples:
Decision trees
Rule-based
Artificial neural networks
Ensembles
Homogeneous Ensembles : all individual models are obtained with the same learning
algorithm, on slightly different datasets
Use a single, arbitrary learning algorithm but manipulate training data to make it
learn multiple models.
Data1 Data2 … Data K
Learner1 = Learner2 = … = Learner K
Different methods for changing training data:
Bagging: Resample training data
Boosting: Reweight training data
Heterogeneous Ensembles : individual models are obtained with different algorithms
Stacking and Blending
combining mechanism is that the output of the classifiers (Level 0 classifiers) will be used as training data
for another classifier (Level 1 classifier)
Methods of Constructing Ensembles
1. Manipulate training data set
2. Cross-validated Committees
3. Weighted Training Examples
4. Manipulating Input Features
5. Manipulating Output Targets
6. Injecting Randomness
Methods of Constructing Ensembles - 1
1. Manipulate training data set
Bagging (bootstrap aggregation)
On each run, Bagging presents the learning algorithm with a training set
drawn randomly, with replacement, from the original training data. This
process is called boostrapping.
Each bootstrap aggregate contains, on the average 63.2% of original training
data, with several examples appearing multiple times
Methods of Constructing Ensembles - 2
2. Cross-validated Committees
Construct training sets by leaving out disjointed sets of training data
Idea similar to k-fold cross validation
3. Maintain a set of weights over the training examples. At each iteration
the weights are changed to place more emphasis on misclassified
examples (Adaboost)
Methods of Constructing Ensembles - 3
4. Manipulating Input Features
Works if the input features are highly redundant (e.g., down sampling FFT
bins)
5. Manipulating Output Targets
6. Injecting Randomness
Variance and Bias
Bias is due to differences
between the model and the
true function.
Variance represents the
sensitivity of the model to
individual data points
Variance and Bias
Variance and Bias
Variance and Bias
Bias-Variance tradeoff
Voting
Simple Ensemble Techniques
Max Voting: multiple models are used to make predictions for each data point.
The predictions by each model are considered as a ‘vote’. The predictions which
we get from the majority of the models are used as the final prediction.
from sklearn.ensemble import VotingClassifier
model1 = LogisticRegression(random_state=1)
model2 = tree.DecisionTreeClassifier(random_state=1)
model = VotingClassifier(estimators=[('lr', model1), ('dt', model2)], voting='hard')
model.fit(x_train,y_train)
model.score(x_test,y_test)
Simple Ensemble Techniques
Averaging: multiple predictions are made for each data point in averaging. In this method, we
take an average of predictions from all the models and use it to make the final prediction.
Averaging can be used for making predictions in regression problems or while calculating
probabilities for classification problems.
model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()
model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)
pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)
finalpred=(pred1+pred2+pred3)/3
Simple Ensemble Techniques
Weighted Average: All models are assigned different weights defining
the importance of each model for prediction
finalpred=(pred1*0.3+pred2*0.3+pred3*0.4)
Bagging and Boosting
Bagging and Boosting
Bagging and Boosting aggregate multiple hypotheses generated by the same
learning algorithm invoked over different distributions of training data
Bagging and Boosting generate a classifier with a smaller error on the training
data as it combines multiple hypotheses which individually have a large error
Bagging : reduce variance
Boosting : reduce bias
Bagging and Boosting
Bagging replicates training sets by sampling with replacement from the
training instances
Boosting uses all instances but weights them and therefore produces different
classifiers
Classifiers are then combined by voting to create a composite classifier
Bagging: classifiers have equal vote. Majority wins
Boosting: vote dependent on the classifier’s accuracy. Extra weightage to the
opinion of some
Bagging ML
f1
ML
f2 f
ML
fT
Boosting
ML
Training Sample f1
ML
Weighted Sample f2
f
…
ML
Weighted Sample fT
Principle of Adaboost
Failure is the mother of success
Strong Weak classifier
classifier
Weight
Features
vector
Bagging
Bagging
Create ensembles by repeatedly randomly resampling the training data (Brieman, 1996).
Given a training set of size N, create K samples, each of size N, by drawing N examples from the
original data, with replacement.
Each bootstrap sample will on average contain 63.2% of the unique training examples, the rest are
replicates.
Combine the K resulting models using simple majority vote.
Decreases error by decreasing the variance in the results due to unstable learners,
algorithms (like decision trees) whose output can change dramatically when the training data is
slightly changed.
Bagging
• Also known as bootstrap aggregation
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
• Sampling uniformly with replacement
• Build classifier on each bootstrap sample
• Each bootstrap sample Di contains approx. 63.2% of the original training
data
• Remaining (36.8%) are used as test set
Bagging
• Decision Stump
• Single level decision binary tree
•Accuracy at most 70%
Bagging
Accuracy of ensemble classifier: 100% ☺
Bagging- Final Points
Works well if the “base classifiers” are unstable
Increased accuracy because it reduces the variance of the individual
classifier
Does not focus on any particular instance of the training data
Therefore, less susceptible to model over-fitting when applied to noisy
data
Bagging Algorithm: Training Phase
1. Initialize the parameters
1. D = {empty set}, the ensemble
2. K = number of classifiers to train
2. For k = 1 to K
1. Take a bootstrap sample Sk from training data
2. Build a classifier Dk using Sk as training set
3. Add the classifier to the ensemble, D = D Dk
3. Return D
Bagging Algorithm: Classification Phase
4. Run D1,...,DK on the input data x
5. The class with the maximum number of votes is chosen as the label for x
Why Bagging Works
Main reasons for error in learning are due bias and variance.
Bias is due to differences between the model and the true function.
Variance represents the sensitivity of the model to individual data points.
Does bagging minimizes these errors ? Yes
Averaging over bootstrap samples can reduce error from variance especially in
unstable classifiers
When is bagging useful?
Bagging is bad if models are very similar (not independent enough)
This happens if the learning algorithm is stable
That is, model does not usually change much after changing a few instances
Other methods
Bagging meta-estimator
Random Forest
Extremely Randomized Trees
Summary of Bagging
Individual models trained on bootstrap-sampled instances, predictions
are aggregated
Bagging is useful when the algorithm to learn individual models is:
Relatively accurate
Relatively unstable (high variance)
The aggregated model is then usually better than the original model
trained on full dataset
Boosting
Boosting
Originally developed by computational learning theorists to guarantee
performance improvements on fitting training data for a weak learner
that only needs to generate a hypothesis with a training accuracy greater
than 0.5 (Schapire, 1990).
Revised to be a practical algorithm, AdaBoost, for building ensembles
that empirically improves generalization performance (Freund & Shapire,
1996).
Boosting
Instances are given weights. At each iteration, a new hypothesis is learned and
the instances are reweighted to focus on instances that the most-recently-
learned classifier got wrong.
Initially, all N instances are assigned equal weights
Unlike bagging, weights may change at the end of a boosting round
Boosting
Equal weights are assigned to each training instance (1/N for round 1)
After a classifier Mi is learned, the weights are adjusted to allow the
subsequent classifier M to “pay more attention” to instances that were
misclassified by Mi.
Final boosted classifier M* combines the votes of each individual classifier
Weight of each classifier’s vote is a function of its accuracy
Adaboost – popular boosting algorithm
Adaboost
AdaBoost (adaptive boosting) is an ensemble learning algorithm that can
be used for classification or regression.
AdaBoost creates the strong learner by iteratively adding weak learners
Toy Example – taken from Antonio Torralba @MIT
Each data point has
a class label:
+1
yt =
-1
and a weight:
wt =1
Weak learners from the
family of lines
h => p(error) = 0.5 it is at chance
Toy example
Each data point has
a class label:
+1
yt =
-1
and a weight:
wt =1
This one seems to be the best
This is a ‘weak classifier’: It performs slightly better than chance.
Toy example
Each data point has
a class label:
+1
yt =
-1
We update the weights:
wt wt exp{-yt Ht}
We set a new problem for which the previous weak classifier performs at chance again
Toy example
Each data point has
a class label:
+1
yt =
-1
We update the weights:
wt wt exp{-yt Ht}
We set a new problem for which the previous weak classifier performs at chance again
Toy example
Each data point has
a class label:
+1
yt =
-1
We update the weights:
wt wt exp{-yt Ht}
We set a new problem for which the previous weak classifier performs at chance again
Toy example
Each data point has
a class label:
+1
yt =
-1
We update the weights:
wt wt exp{-yt Ht}
We set a new problem for which the previous weak classifier performs at chance again
Toy example
f1 f2
f4
f3
The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers.
Adaboost Strategy
At each stage of the algorithm, Adaboost trains a new classifier using a data set
in which the weighting coefficients have been adjusted according to the
performance of the previously trained classifier so as to give greater weight to
misclassified instances
Finally, when the desired number of base classifiers have been trained, their
results are combined to form a committee using different weights to different
classifiers
Adaboost: Initialization
Given a set of input vectors {x1,x2,...,xN} along with binary target values
{t1,t2,...,tN}.
That is, tn Î {-1, +1}
Each instance is given a weight wn
Initially, set wn = 1/N, for all n
Assume that we have a procedure to train the base (weak) classifier. (Say, a
Perceptron)
Boosting Framework
Weight Progression
Each base classifier ym(x) is trained on a weighted form of the training set
(blue arrows)
The weights depend upon the performance of the previous base classifier
ym-1(x) (green arrows)
Adaboost Algorithm
1. Initialize the data weighting coefficients {wn} by setting
wn = 1 for n = 1, 2,..., N
N
2. For m = 1,..., M
a. Fit a classifier ym (x) to the training data by minimizing the weighted error
function
N
J m = å w (nm ) I (ym (xn ) ¹ tn )
n=1
I(ym (xn ) ¹ tn ) =1 when ym (xn ) ¹ tn , 0 otherwise
Indicator Function
The I above is called the indicator function
Notice that I = 1 when an instance is misclassified
Jm is the “error” function of the mth classifier. It identifies the weights
associated with each misclassified training instance and adds them up
The quantity epsilonm can be thought of as “error rate” of each base
classifier on the data set
Epsilon & Alpha
b. Evaluate the quantities
N
å n I (ym (xn ) ¹ tn )
w (m)
em = n=1
N
å n
w (m)
n=1
ì1 - e m ü
am = ln í ý
î em þ
Weight Update & Prediction
c. Update the data weighting coefficients
w (m+1)
n =w (m)
n exp {am I(ym (xn ) ¹ tn )}
3. Make predictions using the final model
æ M ö
YM (x) = sign ç åa m ym (x)÷
è m=1 ø
Comments
Note that the first base classifier is trained using wn
(1)
that are all equal
In subsequent iterations these weights are increased for data instances that are
misclassified and decreased/unchanged for those correctly classified.
The alphas eventually give greater weight to the more accurate classifiers
Experimental Results on Ensembles
(Freund & Schapire, 1996; Quinlan, 1996)
Ensembles have been used to improve generalization accuracy on a
wide variety of problems.
On average, Boosting provides a larger increase in accuracy than
Bagging.
Boosting on rare occasions can degrade accuracy.
Boosting is particularly subject to over-fitting when there is
significant noise in the training data.
Issues in Ensembles
Parallelism in Ensembles: Bagging is easily parallelized, Boosting is not.
Variants of Boosting to handle noisy data.
How “weak” should a base-learner for Boosting be?
Combining Boosting and Bagging.
Adaboost
AdaBoost.M1 and AdaBoost.M2 – original algorithms for binary and multiclass
classification
LogitBoost – binary classification (for poorly separable classes)
Gentle AdaBoost or GentleBoost – binary classification (for use with multilevel
categorical predictors)
RobustBoost – binary classification (robust against label noise)
LSBoost – least squares boosting (for regression ensembles)
Improving
Gradient boosting (GBoosting)
Stochastic Gradient Boosting
Penalized Gradient Boosting
Extreme Gradient Boosting (XGBoost)
Summary:
Can combine many weak classifiers/regressors into a stronger classifier;
voting, averaging, bagging
if weak classifiers/regressors are better than random.
if there is sufficient de-correlation (independence) amongst the weak
classifiers/regressors.
Can combine many (high-bias) weak classifiers/regressors into a strong
classifier; boosting
if weak classifiers/regressors are chosen and combined using knowledge of how
well they and others performed on the task on training data.
The selection and combination encourages the weak classifiers to be
complementary, diverse and de-correlated.
Stacking and Blending
Stacking
Both bagging and boosting assume we have a single “base learning”
algorithm.
But what if we want to ensemble an arbitrary set of classifiers?
E.g., combine the output of a SVM, naïve Bayes, and a nearest neighbor
model?
Stacking
Meta-model
When does stacking work?
Stacking works best when the base models have complementary strengths
and weaknesses.
For example: combining k-nearest neighbor models with different k-
values, Naïve Bayes, and logistic regression. Each of these models has
different underlying assumptions so (hopefully) they will be
complementary.
Stacked learners: first attempt
Stacking
EX:
Step 1: The train set is split into 10 parts.
Step 2: A base model (suppose a
decision tree) is fitted on 9 parts and
predictions are made for the 10th part.
Stacking
Step 3: Using this model, predictions
are made on the test set
Step 4: Steps 2 to 3 are repeated for
another base model (say knn)
resulting in another set of
predictions for the train set and test
set.
Stacking
Step 5: The predictions from the train set are
used as features to build a new model ( can
use logistic regression)
Step 6: This model is used to make final
predictions on the test prediction set.
Blending
Blending follows the same approach as stacking but uses only a holdout
(validation) set from the train set to make predictions.
In other words, unlike stacking, the predictions are made on the holdout
set only.
The holdout set and the predictions are used to build a model which is
run on the test set.
Blending
Step 1: The train set is split into training and
validation sets.
Step 2: Model(s) are fitted on the training set.
Step 3: The predictions are made on the
validation set and the test set.
Step 4: The validation set and its predictions are
used as features to build a new model.
Step 5: This model is used to make final
predictions on the test and meta-features.
Blending
EX: simple code
model1 = tree.DecisionTreeClassifier()
model1.fit(x_train, y_train)
val_pred1=model1.predict(x_val)
test_pred1=model1.predict(x_test)
val_pred1=pd.DataFrame(val_pred1)
test_pred1=pd.DataFrame(test_pred1)
df_val=pd.concat([x_val, val_pred1,val_pred2],axis=1)
model2 = KNeighborsClassifier()
df_test=pd.concat([x_test, test_pred1,test_pred2],axis=1)
model2.fit(x_train,y_train)
val_pred2=model2.predict(x_val)
model = LogisticRegression()
test_pred2=model2.predict(x_test)
model.fit(df_val,y_val)
val_pred2=pd.DataFrame(val_pred2)
model.score(df_test,y_test)
test_pred2=pd.DataFrame(test_pred2)
Netflix challenge - 1 million USD (2006-
2009)
Netflix, an online DVD-rental and online video streaming service
Task: predict user ratings to films from ratings by other users
Goal: improve existing method by 10%
Winner’s solution: ensemble with over 500 heterogeneous models,
aggregated with gradient boosted decision trees
Ensembles based on blending/stacking were key approaches used in the
netflix competition
Conclusions
Ensemble methods combine several hypotheses into one prediction.
They work better than the best individual hypothesis from the same class
because they reduce bias or variance (or both).
Bagging is mainly a variance-reduction technique, useful for complex
hypotheses.
Boosting focuses on harder examples, and gives a weighted vote to the
hypotheses.
Boosting works by reducing bias and increasing classification margin.
Stacking is a generic approach to ensemble various models and performs very
well in practice.
Bài Tập
1) Cài đặt các kỹ thuật trong ensemble model cho bài toán dự đoán bệnh
tiểu đường (Diabetes Predictions)
All patients here are females at least 21 years
old of Pima Indian heritage
Number of Instances: 768
Number of Attributes: 8 plus class
Missing Attribute Values: None
Class Distribution: (class value 1 is interpreted
as "tested positive for diabetes")
Class Value Number of instances 0:500 ; 1: 268
0: tested_negative
1: tested_positive
Bài Tập
❖ Áp dụng các kỹ thuật sau
Voting
Hard voting
Soft voting
Weighted voting
Comparison Ensemble models
Bagging
Boosting
Voting
Diabetes Predictions
https://www.openml.org/d/37
Voting
Based model (scikit-learn)
KNN
RandomForest
Logistic regression
Voting
Hard voting 0.7835497835497836
Soft voting 0.7878787878787878
Weighted voting 0.7922077922077922
Comparison Ensemble models
Based model:
rf = RandomForestClassifier()
et = ExtraTreesClassifier()
knn = KNeighborsClassifier()
svc = SVC()
rg = RidgeClassifier()
❖ Comparison Ensemble models
▪ Bagging
▪ Boosting
▪ Voting
Reference
https://medium.com/@rrfd/boosting-bagging-and-stacking-ensemble-
methods-with-sklearn-and-mlens-a455c0c982de
https://machinelearningmastery.com/ensemble-machine-learning-
algorithms-python-scikit-learn/