Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
44 views90 pages

12 Ensemble Model

Slide python

Uploaded by

Nguyen Dang Vinh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views90 pages

12 Ensemble Model

Slide python

Uploaded by

Nguyen Dang Vinh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Trịnh Tấn Đạt

Khoa CNTT – Đại Học Sài Gòn


Email: [email protected]
Website: https://sites.google.com/site/ttdat88/
Contents
 Introduction
 Voting
 Bagging
 Boosting
 Stacking and Blending
Introduction
Definition
 An ensemble of classifiers is a set of classifiers whose individual decisions
are combined in some way (typically, by weighted or un-weighted voting)
to classify new examples
 Ensembles are often much more accurate than the individual classifiers that
make them up.
Learning Ensembles
 Learn multiple alternative definitions of a concept using different training
data or different learning algorithms.
 Combine decisions of multiple definitions, e.g. using voting.

Training Data

Data 1 Data 2  Data K

Learner 1 Learner 2  Learner K

Model 1 Model 2  Model K

Model Combiner Final Model


Necessary and Sufficient Condition
 For the idea to work, the classifiers should be
 Accurate
 Diverse
 Accurate: Has an error rate better than random guessing on new instances
 Diverse: They make different errors on new data points
Why they Work?
 Suppose there are 25 base classifiers
 Each classifier has an error rate,  = 0.35
 Assume classifiers are independent
 Probability that the ensemble classifier makes a wrong prediction:

25
 25  i
 
 i 
i =13 
 (1 −  ) 25 − i
= 0.06

Marquis de Condorcet (1785) Majority vote is wrong with probability:
Value of Ensembles
 When combing multiple independent and diverse decisions each of which is at
least more accurate than random guessing, random errors cancel each other out,
correct decisions are reinforced.
 Human ensembles are demonstrably better
 How many jelly beans in the jar?: Individual estimates vs. group average.
A Motivating Example
 Suppose that you are a patient with a set of symptoms
 Instead of taking opinion of just one doctor (classifier), you decide to take
opinion of a few doctors!
 Is this a good idea? Indeed it is.
 Consult many doctors and then based on their diagnosis; you can get a fairly
accurate idea of the diagnosis.
The Wisdom of Crowds
 The collective knowledge of a diverse and independent body of people
typically exceeds the knowledge of any single individual and can be
harnessed by voting
When Ensembles Work?
 Ensemble methods work better with ‘unstable classifiers’
 Classifiers that are sensitive to minor perturbations in the training set
 Examples:
 Decision trees
 Rule-based
 Artificial neural networks
Ensembles
 Homogeneous Ensembles : all individual models are obtained with the same learning
algorithm, on slightly different datasets
 Use a single, arbitrary learning algorithm but manipulate training data to make it
learn multiple models.
 Data1  Data2  …  Data K
 Learner1 = Learner2 = … = Learner K
 Different methods for changing training data:
 Bagging: Resample training data
 Boosting: Reweight training data

 Heterogeneous Ensembles : individual models are obtained with different algorithms


 Stacking and Blending
 combining mechanism is that the output of the classifiers (Level 0 classifiers) will be used as training data
for another classifier (Level 1 classifier)
Methods of Constructing Ensembles
1. Manipulate training data set
2. Cross-validated Committees
3. Weighted Training Examples
4. Manipulating Input Features
5. Manipulating Output Targets
6. Injecting Randomness
Methods of Constructing Ensembles - 1
1. Manipulate training data set
 Bagging (bootstrap aggregation)
 On each run, Bagging presents the learning algorithm with a training set
drawn randomly, with replacement, from the original training data. This
process is called boostrapping.
 Each bootstrap aggregate contains, on the average 63.2% of original training
data, with several examples appearing multiple times
Methods of Constructing Ensembles - 2
2. Cross-validated Committees
 Construct training sets by leaving out disjointed sets of training data
 Idea similar to k-fold cross validation
3. Maintain a set of weights over the training examples. At each iteration
the weights are changed to place more emphasis on misclassified
examples (Adaboost)
Methods of Constructing Ensembles - 3
4. Manipulating Input Features
 Works if the input features are highly redundant (e.g., down sampling FFT
bins)
5. Manipulating Output Targets
6. Injecting Randomness
Variance and Bias
 Bias is due to differences
between the model and the
true function.
 Variance represents the
sensitivity of the model to
individual data points
Variance and Bias
Variance and Bias
Variance and Bias
Bias-Variance tradeoff
Voting
Simple Ensemble Techniques
 Max Voting: multiple models are used to make predictions for each data point.
The predictions by each model are considered as a ‘vote’. The predictions which
we get from the majority of the models are used as the final prediction.

from sklearn.ensemble import VotingClassifier


model1 = LogisticRegression(random_state=1)
model2 = tree.DecisionTreeClassifier(random_state=1)
model = VotingClassifier(estimators=[('lr', model1), ('dt', model2)], voting='hard')
model.fit(x_train,y_train)
model.score(x_test,y_test)
Simple Ensemble Techniques
 Averaging: multiple predictions are made for each data point in averaging. In this method, we
take an average of predictions from all the models and use it to make the final prediction.
Averaging can be used for making predictions in regression problems or while calculating
probabilities for classification problems.

model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()
model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)
pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)

finalpred=(pred1+pred2+pred3)/3
Simple Ensemble Techniques
 Weighted Average: All models are assigned different weights defining
the importance of each model for prediction

finalpred=(pred1*0.3+pred2*0.3+pred3*0.4)
Bagging and Boosting
Bagging and Boosting
 Bagging and Boosting aggregate multiple hypotheses generated by the same
learning algorithm invoked over different distributions of training data
 Bagging and Boosting generate a classifier with a smaller error on the training
data as it combines multiple hypotheses which individually have a large error

 Bagging : reduce variance


 Boosting : reduce bias
Bagging and Boosting
 Bagging replicates training sets by sampling with replacement from the
training instances
 Boosting uses all instances but weights them and therefore produces different
classifiers
 Classifiers are then combined by voting to create a composite classifier
 Bagging: classifiers have equal vote. Majority wins
 Boosting: vote dependent on the classifier’s accuracy. Extra weightage to the
opinion of some
Bagging ML

f1

ML
f2 f

ML
fT
Boosting
ML
Training Sample f1
ML
Weighted Sample f2
f

ML
Weighted Sample fT
Principle of Adaboost
 Failure is the mother of success

Strong Weak classifier


classifier
Weight
Features
vector
Bagging
Bagging
 Create ensembles by repeatedly randomly resampling the training data (Brieman, 1996).
 Given a training set of size N, create K samples, each of size N, by drawing N examples from the
original data, with replacement.
 Each bootstrap sample will on average contain 63.2% of the unique training examples, the rest are
replicates.
 Combine the K resulting models using simple majority vote.
 Decreases error by decreasing the variance in the results due to unstable learners,
algorithms (like decision trees) whose output can change dramatically when the training data is
slightly changed.
Bagging
• Also known as bootstrap aggregation
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

• Sampling uniformly with replacement


• Build classifier on each bootstrap sample
• Each bootstrap sample Di contains approx. 63.2% of the original training
data
• Remaining (36.8%) are used as test set
Bagging
• Decision Stump
• Single level decision binary tree
•Accuracy at most 70%
Bagging

Accuracy of ensemble classifier: 100% ☺


Bagging- Final Points
 Works well if the “base classifiers” are unstable
 Increased accuracy because it reduces the variance of the individual
classifier
 Does not focus on any particular instance of the training data
 Therefore, less susceptible to model over-fitting when applied to noisy
data
Bagging Algorithm: Training Phase
1. Initialize the parameters
1. D = {empty set}, the ensemble
2. K = number of classifiers to train
2. For k = 1 to K
1. Take a bootstrap sample Sk from training data
2. Build a classifier Dk using Sk as training set
3. Add the classifier to the ensemble, D = D  Dk
3. Return D
Bagging Algorithm: Classification Phase
4. Run D1,...,DK on the input data x
5. The class with the maximum number of votes is chosen as the label for x
Why Bagging Works
 Main reasons for error in learning are due bias and variance.
 Bias is due to differences between the model and the true function.
 Variance represents the sensitivity of the model to individual data points.
 Does bagging minimizes these errors ? Yes
 Averaging over bootstrap samples can reduce error from variance especially in
unstable classifiers
When is bagging useful?
 Bagging is bad if models are very similar (not independent enough)
 This happens if the learning algorithm is stable
 That is, model does not usually change much after changing a few instances
Other methods
 Bagging meta-estimator

 Random Forest

 Extremely Randomized Trees


Summary of Bagging
 Individual models trained on bootstrap-sampled instances, predictions
are aggregated
 Bagging is useful when the algorithm to learn individual models is:
 Relatively accurate
 Relatively unstable (high variance)
 The aggregated model is then usually better than the original model
trained on full dataset
Boosting
Boosting
 Originally developed by computational learning theorists to guarantee
performance improvements on fitting training data for a weak learner
that only needs to generate a hypothesis with a training accuracy greater
than 0.5 (Schapire, 1990).
 Revised to be a practical algorithm, AdaBoost, for building ensembles
that empirically improves generalization performance (Freund & Shapire,
1996).
Boosting
 Instances are given weights. At each iteration, a new hypothesis is learned and
the instances are reweighted to focus on instances that the most-recently-
learned classifier got wrong.
 Initially, all N instances are assigned equal weights
 Unlike bagging, weights may change at the end of a boosting round
Boosting
 Equal weights are assigned to each training instance (1/N for round 1)
 After a classifier Mi is learned, the weights are adjusted to allow the
subsequent classifier M to “pay more attention” to instances that were
misclassified by Mi.
 Final boosted classifier M* combines the votes of each individual classifier
 Weight of each classifier’s vote is a function of its accuracy
 Adaboost – popular boosting algorithm
Adaboost
 AdaBoost (adaptive boosting) is an ensemble learning algorithm that can
be used for classification or regression.
 AdaBoost creates the strong learner by iteratively adding weak learners
Toy Example – taken from Antonio Torralba @MIT
Each data point has
a class label:
+1
yt =
-1

and a weight:
wt =1

Weak learners from the


family of lines

h => p(error) = 0.5 it is at chance


Toy example

Each data point has


a class label:
+1
yt =
-1

and a weight:
wt =1

This one seems to be the best


This is a ‘weak classifier’: It performs slightly better than chance.
Toy example

Each data point has


a class label:
+1
yt =
-1

We update the weights:


wt wt exp{-yt Ht}

We set a new problem for which the previous weak classifier performs at chance again
Toy example

Each data point has


a class label:
+1
yt =
-1

We update the weights:


wt wt exp{-yt Ht}

We set a new problem for which the previous weak classifier performs at chance again
Toy example

Each data point has


a class label:
+1
yt =
-1

We update the weights:


wt wt exp{-yt Ht}

We set a new problem for which the previous weak classifier performs at chance again
Toy example

Each data point has


a class label:
+1
yt =
-1

We update the weights:


wt wt exp{-yt Ht}

We set a new problem for which the previous weak classifier performs at chance again
Toy example
f1 f2

f4

f3

The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers.
Adaboost Strategy
 At each stage of the algorithm, Adaboost trains a new classifier using a data set
in which the weighting coefficients have been adjusted according to the
performance of the previously trained classifier so as to give greater weight to
misclassified instances
 Finally, when the desired number of base classifiers have been trained, their
results are combined to form a committee using different weights to different
classifiers
Adaboost: Initialization
 Given a set of input vectors {x1,x2,...,xN} along with binary target values
{t1,t2,...,tN}.
 That is, tn Î {-1, +1}
 Each instance is given a weight wn
 Initially, set wn = 1/N, for all n
 Assume that we have a procedure to train the base (weak) classifier. (Say, a
Perceptron)
Boosting Framework
Weight Progression
 Each base classifier ym(x) is trained on a weighted form of the training set
(blue arrows)
 The weights depend upon the performance of the previous base classifier
ym-1(x) (green arrows)
Adaboost Algorithm
1. Initialize the data weighting coefficients {wn} by setting
wn = 1 for n = 1, 2,..., N
N
2. For m = 1,..., M
a. Fit a classifier ym (x) to the training data by minimizing the weighted error
function
N
J m = å w (nm ) I (ym (xn ) ¹ tn )
n=1

I(ym (xn ) ¹ tn ) =1 when ym (xn ) ¹ tn , 0 otherwise


Indicator Function
 The I above is called the indicator function
 Notice that I = 1 when an instance is misclassified

 Jm is the “error” function of the mth classifier. It identifies the weights


associated with each misclassified training instance and adds them up
 The quantity epsilonm can be thought of as “error rate” of each base
classifier on the data set
Epsilon & Alpha
b. Evaluate the quantities
N

å n I (ym (xn ) ¹ tn )
w (m)

em = n=1
N

å n
w (m)

n=1

ì1 - e m ü
am = ln í ý
î em þ
Weight Update & Prediction
c. Update the data weighting coefficients

w (m+1)
n =w (m)
n exp {am I(ym (xn ) ¹ tn )}
3. Make predictions using the final model

æ M ö
YM (x) = sign ç åa m ym (x)÷
è m=1 ø
Comments
 Note that the first base classifier is trained using wn
(1)
that are all equal
 In subsequent iterations these weights are increased for data instances that are
misclassified and decreased/unchanged for those correctly classified.
 The alphas eventually give greater weight to the more accurate classifiers
Experimental Results on Ensembles
(Freund & Schapire, 1996; Quinlan, 1996)
 Ensembles have been used to improve generalization accuracy on a
wide variety of problems.
 On average, Boosting provides a larger increase in accuracy than
Bagging.
 Boosting on rare occasions can degrade accuracy.
 Boosting is particularly subject to over-fitting when there is
significant noise in the training data.
Issues in Ensembles
 Parallelism in Ensembles: Bagging is easily parallelized, Boosting is not.
 Variants of Boosting to handle noisy data.
 How “weak” should a base-learner for Boosting be?
 Combining Boosting and Bagging.
Adaboost
 AdaBoost.M1 and AdaBoost.M2 – original algorithms for binary and multiclass
classification
 LogitBoost – binary classification (for poorly separable classes)
 Gentle AdaBoost or GentleBoost – binary classification (for use with multilevel
categorical predictors)
 RobustBoost – binary classification (robust against label noise)
 LSBoost – least squares boosting (for regression ensembles)
Improving
 Gradient boosting (GBoosting)
 Stochastic Gradient Boosting
 Penalized Gradient Boosting
 Extreme Gradient Boosting (XGBoost)
Summary:
 Can combine many weak classifiers/regressors into a stronger classifier;
voting, averaging, bagging
 if weak classifiers/regressors are better than random.
 if there is sufficient de-correlation (independence) amongst the weak
classifiers/regressors.
 Can combine many (high-bias) weak classifiers/regressors into a strong
classifier; boosting
 if weak classifiers/regressors are chosen and combined using knowledge of how
well they and others performed on the task on training data.
 The selection and combination encourages the weak classifiers to be
complementary, diverse and de-correlated.
Stacking and Blending
Stacking
 Both bagging and boosting assume we have a single “base learning”
algorithm.
 But what if we want to ensemble an arbitrary set of classifiers?
 E.g., combine the output of a SVM, naïve Bayes, and a nearest neighbor
model?
Stacking
Meta-model
When does stacking work?
 Stacking works best when the base models have complementary strengths
and weaknesses.
 For example: combining k-nearest neighbor models with different k-
values, Naïve Bayes, and logistic regression. Each of these models has
different underlying assumptions so (hopefully) they will be
complementary.
Stacked learners: first attempt
Stacking
 EX:
 Step 1: The train set is split into 10 parts.

 Step 2: A base model (suppose a


decision tree) is fitted on 9 parts and
predictions are made for the 10th part.
Stacking
 Step 3: Using this model, predictions
are made on the test set

 Step 4: Steps 2 to 3 are repeated for


another base model (say knn)
resulting in another set of
predictions for the train set and test
set.
Stacking
 Step 5: The predictions from the train set are
used as features to build a new model ( can
use logistic regression)

 Step 6: This model is used to make final


predictions on the test prediction set.
Blending
 Blending follows the same approach as stacking but uses only a holdout
(validation) set from the train set to make predictions.
 In other words, unlike stacking, the predictions are made on the holdout
set only.
 The holdout set and the predictions are used to build a model which is
run on the test set.
Blending
 Step 1: The train set is split into training and
validation sets.
 Step 2: Model(s) are fitted on the training set.

 Step 3: The predictions are made on the


validation set and the test set.
 Step 4: The validation set and its predictions are
used as features to build a new model.
 Step 5: This model is used to make final
predictions on the test and meta-features.
Blending
 EX: simple code
model1 = tree.DecisionTreeClassifier()
model1.fit(x_train, y_train)
val_pred1=model1.predict(x_val)
test_pred1=model1.predict(x_test)
val_pred1=pd.DataFrame(val_pred1)
test_pred1=pd.DataFrame(test_pred1)
df_val=pd.concat([x_val, val_pred1,val_pred2],axis=1)
model2 = KNeighborsClassifier()
df_test=pd.concat([x_test, test_pred1,test_pred2],axis=1)
model2.fit(x_train,y_train)
val_pred2=model2.predict(x_val)
model = LogisticRegression()
test_pred2=model2.predict(x_test)
model.fit(df_val,y_val)
val_pred2=pd.DataFrame(val_pred2)
model.score(df_test,y_test)
test_pred2=pd.DataFrame(test_pred2)
Netflix challenge - 1 million USD (2006-
2009)
 Netflix, an online DVD-rental and online video streaming service
 Task: predict user ratings to films from ratings by other users
 Goal: improve existing method by 10%
 Winner’s solution: ensemble with over 500 heterogeneous models,
aggregated with gradient boosted decision trees

 Ensembles based on blending/stacking were key approaches used in the


netflix competition
Conclusions
 Ensemble methods combine several hypotheses into one prediction.
 They work better than the best individual hypothesis from the same class
because they reduce bias or variance (or both).
 Bagging is mainly a variance-reduction technique, useful for complex
hypotheses.
 Boosting focuses on harder examples, and gives a weighted vote to the
hypotheses.
 Boosting works by reducing bias and increasing classification margin.
 Stacking is a generic approach to ensemble various models and performs very
well in practice.
Bài Tập
1) Cài đặt các kỹ thuật trong ensemble model cho bài toán dự đoán bệnh
tiểu đường (Diabetes Predictions)

All patients here are females at least 21 years


old of Pima Indian heritage
Number of Instances: 768
Number of Attributes: 8 plus class
Missing Attribute Values: None
Class Distribution: (class value 1 is interpreted
as "tested positive for diabetes")
Class Value Number of instances 0:500 ; 1: 268

0: tested_negative
1: tested_positive
Bài Tập
❖ Áp dụng các kỹ thuật sau
 Voting
 Hard voting
 Soft voting
 Weighted voting

 Comparison Ensemble models


 Bagging
 Boosting
 Voting
Diabetes Predictions
 https://www.openml.org/d/37
Voting
 Based model (scikit-learn)
 KNN
 RandomForest
 Logistic regression

 Voting
 Hard voting 0.7835497835497836
 Soft voting 0.7878787878787878
 Weighted voting 0.7922077922077922
Comparison Ensemble models
 Based model:
 rf = RandomForestClassifier()
 et = ExtraTreesClassifier()
 knn = KNeighborsClassifier()
 svc = SVC()
 rg = RidgeClassifier()

❖ Comparison Ensemble models


▪ Bagging
▪ Boosting
▪ Voting
Reference
 https://medium.com/@rrfd/boosting-bagging-and-stacking-ensemble-
methods-with-sklearn-and-mlens-a455c0c982de
 https://machinelearningmastery.com/ensemble-machine-learning-
algorithms-python-scikit-learn/

You might also like