Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
29 views52 pages

Ensemble Learning

Uploaded by

luthfihakim2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views52 pages

Ensemble Learning

Uploaded by

luthfihakim2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Ensemble Learning Methods

Ensemble Learning

• A n e n s e m b l e is a c o m p o s i t e m o d e l , c o m b i n e s a series o f l o w
p e r f o r m i n g classifiers w i t h t h e a i m o f c re a t i n g an i m p r o v e d classifier.
• Here, individual classifier v o t e a n d f i n a l p r e d i c t i o n label r e t u r n e d t h a t
performs majority voting.
• Ensembles o f f e r m o r e accuracy t h a n individual o r base classifier.
• Ensemble m e t h o d s can parallelize b y allocating each base learner t o
d i f f e r e n t - d i f f e r e n t machines.
• Finally, y o u can say Ensemble learning m e t h o d s are m e t a - a l g o r i t h m s
t h a t c o m b i n e several machine learning m e t h o d s i n t o a single
p r e d i c t iv e m o d e l t o increase p e r f o r m a n c e .
• Ensemble m e t h o d s can decrease variance using b a g g i n g approach,
bias using a b o o s t i n g approach, o r i m p r o v e p r e d i c t i o ns using stacking
approach.
Real l i f e examp l e s

• Let's t a k e a real e x a m p l e t o b u i l d t h e i n t u i t i o n .
• Suppose, y o u w a n t t o invest in a c o m p a n y XYZ.
You are n o t sure a b o u t its p e r f o r m a n c e t h o u g h .
• So, y o u l o o k f o r advice o n w h e t h e r t h e s toc k
price w i l l increase b y m o r e t h a n 6 % p e r a n n u m
o r not?
• You decide t o approach various e x p e r t s having
diverse d o m a i n experience:
The survey p r e d i c t i o n

• E mployee o f Company XYZ:


– In t h e past, h e has b e e n r i g h t 7 0 % times.
• Financial Advisor o f Com pany XYZ:
– In t h e past, h e has b e e n r i g h t 7 5 % times.
• Stock M a r k e t Trader:
– In t h e past, h e has b e e n r i g h t 7 0 % times.
• E m ployee o f a c o m p e t i t o r :
– In t h e past, h e has b e e n r i g h t 6 0 % times.
• M a r k e t Research t e a m in t h e same s e g m e n t:
– In t h e past, h e has b e e n r i g h t 7 5 % times.
• Social M e d i a Expert:
– In t h e past, h e has b e e n r i g h t 6 5 % times.
Conclusion

• Given t h e b r o a d s p e c t r u m o f access y o u have, y o u can


p r o b a b l y c o m b i n e all t h e i n f o r m a t i o n a n d m a k e an i n f o r m e d
decision.
• In a scenario w h e n all t h e 6 e x p e r t s / t e a m s v e r i f y t h a t it’s a
g o o d decision(assuming all t h e p r e d i c t i o n s a r e i n d e p e n d e n t o f
each o t h e r) , y o u w i l l g e t a c o m b i n e d accuracy r a t e o f 1 - ( 3 0 % .
2 5 % . 3 0 % . 4 0 % . 2 5 % . 3 5 %) = 1 - 0 .0 7 8 7 5 = 9 9 . 9 2 1 2 5 %
• T h e a s s u m p t i o n u se d h e r e t h a t all t h e p r e d i c t i o n s a r e
c o m p l e t e l y i n d e p e n d e n t is sl i g h t l y e x t r e m e as t h e y a r e
e x p e c t e d t o b e c o r r e l a t e d . H o w e v e r, y o u can see h o w w e can
b e so sure b y c o m b i n i n g various fo re ca sts t o g e t h e r.
• Well, E n s e m b l e l e a r n i n g is no d i f f e r e n t .
Ensembling

• A n e n s e m b l e is t h e a r t o f c o m b i n i n g a diverse s et o f
learners (individual mo d e l s ) t o g e t h e r t o i mp ro v i se
o n t h e stability a n d p re d i c ti v e p o w e r o f t h e m o d e l .
• In o u r example, t h e w a y w e c o m b i n e all t h e
predi c tio ns collectively w i l l b e t e r m e d as Ensemble
learning.
• M o r e o v e r , Ensemble-based m o d e l s can b e
i n c o r p o r a t e d in b o t h o f t h e t w o scenarios, i.e., w h e n
d a t a is o f large v o l u m e a n d w h e n d a t a is t o o l i ttl e .
Ensemble Schematic
Basic Ensemble S t r u c t u r e
Summary

• Use multiple learning algorithms (classifiers)


• Combine the decisions
• Can be more accurate than the individual classifiers
• Generate a group of base-learners
• Different learners use different
– Algorithms
– Hyperparameters
– Representations (Modalities)
– Training sets
H o w m o d e l s are d i f f e r e n t ?

• Difference in population
• Difference in hypothesis
• Difference in modeling technique
• Difference in initial seed
W h y ensembles ?

• There are two main reasons to use an ensemble over


a single model, and they are related; they are:
– Performance: An ensemble can make better
predictions and achieve better performance than
any single contributing model.
– Robustness: An ensemble reduces the spread or
dispersion of the predictions and model
performance.
M o d e l Error

• The e r r o r e m e r g i n g f r o m any m a c h i n e m o d e l can b e b r o k e n


d o w n i n t o t h r e e c o m p o n e n t s m a t h e m a t i c a l l y. F o l l o w i n g are
these component:

• Bias error is u s e f u l t o q u a n t i f y h o w m u c h o n an average are t h e


p r e d i c t e d values d i f f e r e n t f r o m t h e actual value. A h i g h bias
e r r o r m e a n s w e have an u n d e r - p e r f o r m i n g m o d e l w h i c h k e e p s
o n missing essential t r e n d s .
• Variance o n t h e o t h e r side q u a n t i f i e s h o w are t h e p r e d i c t i o n
m a d e o n t h e s a m e o b s e r v a t i o n d i f f e r e n t f r o m each o t h e r. A
h i g h variance m o d e l w i l l o v e r - f i t o n y o u r t r a i n i n g p o p u l a t i o n a n d
p e r f o r m p o o r l y o n any o b s e r v a t i o n b e y o n d training.
Bias a n d Variance
Bias – Variance T r a d e o f f

• Normally, as you increase the complexity of your model,


you will see a reduction in error due to lower bias in the
model. However, this only happens till a particular point.
• As you continue to make your model more complex, you
end up over-fitti ng your model and hence your model will
start suffering from high variance.
• A champion model should maintain a balance between
these two types of errors. This is known as the trade-off
management of bias-variance errors. Ensemble learning is
one way to execute this trade off analysis.
Bias – Variance T r a d e o f f
Ensemble Cre a ti o n Ap p ro a ch e s

• Unweighted Voting (e.g. Bagging)


• Weighted voting – based on accuracy (e.g.
Boosting), Expertise, etc.
• Stacking - Learn the combination function
Boosting vs Bagging
Ensemble Learning M e t h o d s

• B a g g in g
• B oo st in g
• Stacking
Bootstrap

• The b o o t s t r a p is a p o w e r f u l statistical m e t h o d f o r
e s t i m a t i n g a q u a n t i t y f r o m a d a t a sample. This is
easiest t o u n d e r s t a n d i f t h e q u a n t i t y is a descriptive
statisti c such as a m e a n o r a s t a n d a r d deviation.
• Let’s assume w e have a s a m p l e o f 1 0 0 values (x) a n d
w e ’ d like t o g e t an e s t i m a t e o f t h e m e a n o f t h e
sample.
• W e can calculate t h e m e a n d i r e c t l y f r o m t h e s a m p l e
as:
mean(x) = 1 / 1 0 0 * sum(x)
Bootstrap

• W e k n o w t h a t o u r s a mp le is small a n d t h a t o u r
m e a n has e r r o r in it. W e can i m p r o v e t h e e s t i m a t e
o f o u r m e a n using t h e b o o t s t r a p proc e d u re :
• Create m a n y (e.g. 1000) r a n d o m sub-samples o f
o u r d a t a s e t w i t h r e p l a c e m e n t ( m e a n i n g w e can
select t h e same value m u l t i p l e times).
• Calculate t h e m e a n o f each sub-sample.
• Calculate t h e average o f all o f o u r c o l l e c t e d means
a n d use t h a t as o u r e s t i m a t e d m e a n f o r t h e data.
Bootstrap

• For example, let’s say w e used 3 resamples a n d


g o t t h e m e a n values 2.3, 4.5 a n d 3.3.
• Taking t h e average o f t h e s e w e c o u l d t a k e t h e
e s t i m a t e d m e a n o f t h e d a t a t o b e 3.367.
• This process can b e used t o e s t i m a t e o t h e r
q u a n t i t i e s like t h e s t a n d a r d d e v ia t io n a n d ev en
q u a n t i t i e s used in mac hine learning algorith ms ,
like lear ned coefficients.
Bagging

• B a g g i n g stands f o r b o o t s t r a p aggrega tio n .


• I t c o m b i n e s m u l t i p l e learners in a w a y t o r e d u c e
t h e variance o f estimates.
• For example, r a n d o m f o r e s t trains M Decision
Tree, y o u can t r a i n M d i f f e r e n t t r e e s o n d i f f e r e n t
r a n d o m subsets o f t h e d a t a a n d p e r f o r m v o t i n g f o r
final predic tion.
• Example:
– R a n d o m Forest
– Extra Trees.
Bagging
R a n d o m Fore st

• R a n d o m f o r e s t is a t y p e o f supervised m achi ne l earni ng


a l g o r i t h m based o n e n s e m b l e learning.
• E nsem bl e l earni ng is a t y p e o f l earni ng w h e r e y o u j oi n
d i f f e r e n t t y p e s o f a l g o r i t h m s o r sam e a l g o r i t h m m u l t i p l e
times t o f orm a more powerful prediction model.
• T he r a n d o m f o r e s t a l g o r i t h m c o m b i n e s m u l t i p l e
a l g o r i t h m o f t h e sam e t y p e i.e. m u l t i p l e decision trees,
r e s u l t i n g in a f o r e s t o f trees, hence t h e n a m e " R a n d o m
Forest".
• T he r a n d o m f o r e s t a l g o r i t h m can b e used f o r b o t h
regression a n d classification tasks.
How it works ?

• Pick N r a n d o m records f r o m t h e dataset.


• B u i l d a decision t r e e based o n t h e s e N records.
• Choose t h e n u m b e r o f t r e e s y o u w a n t in y o u r a l g o r i t h m
a n d r e p e a t ste p s 1 a n d 2.
• In case o f a regression p r o b l e m , f o r a n e w record, each
t r e e in t h e f o r e s t p r e d i c t s a value f o r Y ( o u t p u t ) . The f i n a l
value can b e c a lc ula ted b y t a k i n g t h e average o f all t h e
values p r e d i c t e d b y all t h e t r e e s in f o r e s t . Or, in case o f a
classification p r o b l e m , each t r e e in t h e f o r e s t p r e d icts t h e
c a t e g o r y t o w h i c h t h e n e w r e c o r d belongs. Finally, t h e
n e w r e c o r d is assigned t o t h e c a t e g o r y t h a t w i n s t h e
majority vote.

Majority Voting

Source: medium.com
Regressor O u t p u t

Source: medium.com
Boosting

• B o o s t i n g a l g o r i t h m s are a s e t o f t h e l o w accurate classifier t o


c r e a t e a h i g h l y accurate classifier.
• L o w accuracy classifier ( o r w e a k classifier) o f f e r s t h e accuracy
b e t t e r t h a n t h e f l i p p i n g o f a coin.
• This is d o n e b y b u i l d i n g a m o d e l f r o m t h e t r a i n i n g data, t h e n
creating a second model t h a t attempts t o correct t h e errors
f r o m t h e f i r s t m o d e l . M o d e l s are a d d e d u n t i l t h e t r a i n i n g s e t is
p r e d i c t e d p e r f e c t l y o r a m a x i m u m n u m b e r o f m o d e l s are a d d e d .
• H i g h l y accurate classifier( o r s t r o n g classifier) o f f e r e r r o r r a t e
close t o 0. B o o s t i n g a l g o r i t h m can t r a c k t h e m o d e l w h o f a i l e d
t h e accurate p r e d i c t i o n .
• B o o s t i n g a l g o r i t h m s are less a f f e c t e d b y t h e o v e r f i t t i n g
problem.
Boosting Models

• M o d e l s t h a t are typically used in B o o s t i n g


t e c h n i q u e are:
– XGBoost ( Ex t r e m e Gradient Bo o s t i n g )
– G B M (Gradient B o o s t i n g Machine)
– A D A B o o s t ( Ad a p t iv e Bo o s t in g )
A d a b o o s t Sum m a r y

• Initially, A d a b o o s t selects a t r a i n i n g subset ra n d o m l y.


• I t i t e ra t i v el y trains t h e A d a B o o s t machine learning m o d e l b y selecting
t h e t r a i n i n g s et based o n t h e accurate p r e d i c t i o n o f t h e last training.
• I t assigns t h e hi gher w e i g h t t o w r o n g classified observations so t h a t in
t h e n e x t i t e r a t i o n t hes e observations w i l l g e t t h e hi gh p r o b a b i l i t y f o r
classification.
• Also, I t assigns t h e w e i g h t t o t h e t r a i n e d classifier in each i t e r a t i o n
according t o t h e accuracy o f t h e classifier. The m o r e accurate classifier
w i l l g e t hi gh w e i g h t .
• This process i t e r a t e u n t i l t h e c o m p l e t e t r a i n i n g d a t a f i t s w i t h o u t any
e r r o r o r u n t i l reached t o t h e specified m a x i m u m n u m b e r o f estimators.
• To classify, p e r f o r m a " v o t e " across all o f t h e learning a l g o r i t h m s y o u
built.
H o w A d a b o o s t Works?
Adaboost Classifier

Weighting function
Adaboost regressor

from sklearn.ensemble import


AdaBoostRegressor
from sklearn.tree import
DecisionTreeRegressor

regr_1 = DecisionTreeRegressor(max_depth=4)

regr_2 = AdaBoostRegressor(
DecisionTreeRegressor(max_depth=4),
n_estimators=300, random_state=rng
)

regr_1.fit(X, y)
regr_2.fit(X, y)

y_1 = regr_1.predict(X)
y_2 = regr_2.predict(X)
C omparison
Stacking

• Stacked Generalization o r “ Stacki ng” f o r s h o r t is


an e n s e m b l e m a c h i n e l ear ni ng a l g o r i t h m .
• I t involves c o m b i n i n g t h e p r e d i c t i o n s f r o m
m u l t i p l e m a c h i n e l ear ni ng m o d e l s o n t h e same
dataset, like b a g g i n g a n d b o o s t i n g .
• Stacking addresses t h e questi on:
– Given m u l t i p l e m a c h i n e l e a r n i n g m o d e l s t h a t
are skillful o n a p r o b l e m , b u t in d i f f e r e n t
ways, h o w d o y o u choose w h i c h m o d e l t o use
(trust)?
Stacking

• The a p p r o a c h t o t his q u e s t i o n is t o use a n o t h e r m a c h i n e


lear ni n g m o d e l t h a t learns w h e n t o use o r t r u s t each
m o d e l in t h e ensem b l e.
– Unlike bagging, in stacking, t h e m o d e l s ar e typically
d i f f e r e n t (e.g. n o t all decision t r ees) a n d f i t o n t h e
same d a t a s e t (e.g. inst ea d o f samples o f t h e t r a i n i n g
dataset).
– Unlike b o o s t i n g , in stacking, a single m o d e l is used t o
lear n h o w t o b e s t c o m b i n e t h e p r e d i c t i o n s f r o m t h e
c o n t r i b u t i n g m o d e l s (e.g. inst ea d o f a sequence o f
m o d e l s t h a t c o r r e c t t h e p r e d i c t i o n s o f p r i o r models).
Stacking

• The a rc h i te c tu re o f a stacking m o d e l involves t w o o r


m o r e base models, o f t e n r e f e r r e d t o as level-0
models, a n d a m e t a - m o d e l t h a t c o mb i n es t h e
p re d i c tio ns o f t h e base models, r e f e r r e d t o as a
level-1 m o d e l .
– Level-0 M o d e l s (Base-Models): M o d e l s f i t o n t h e
t r a i n i n g d a t a a n d w h o s e p re d i c ti on s are
compiled.
– Level-1 M o d e l (Meta -Model): M o d e l t h a t learns
h o w t o b e s t c o m b i n e t h e p re d i c tion s o f t h e base
models.
Stacking
Stacking

• The m e t a - m o d e l is t r a i n e d o n t h e p r e d i c t i o n s m a d e
b y base m o d e l s o n o u t - o f - s a m p l e data.
• T h a t is, d a t a n o t used t o t r a i n t h e base m o d e l s is f e d
t o t h e base mo d els, p r e d i c t i o n s are m a d e , a n d t h e s e
predicti o n s, a l o n g w i t h t h e e x p e c t e d o u t p u t s ,
p r o v i d e t h e i n p u t a n d o u t p u t pairs o f t h e t r a i n i n g
d a t a s e t u se d t o f i t t h e m e t a - m o d e l .
• The o u t p u t s f r o m t h e base m o d e l s used as i n p u t t o
t h e m e t a - m o d e l m a y b e real value in t h e case o f
regression, a n d p r o b a b i l i t y values, p r o b a b i l i t y like
values, o r class labels in t h e case o f classification.
Voting

• Vo t i n g is an e n s e m b l e machine learning a l g o r i t h m .
• For regression, a v o t i n g e n s e m b l e involves m a k i n g a
p r e d i c t i o n t h a t is t h e average o f m u l t i p l e o t h e r
regression models.
• In classification, a h a r d v o t i n g e n s e m b l e involves
s u m m i n g t h e v o t e s f o r crisp class labels f r o m o t h e r
m o d e l s a n d p r e d i c t i n g t h e class w i t h t h e m o s t votes.
• A s o f t v o t i n g e n s e m b l e involves s u m m i n g t h e
p r e d i c t e d p ro b a bilities f o r class labels a n d p r e d i c t i n g
t h e class label w i t h t h e l argest s u m probability.
Voting

• A v o t i n g e n s e m b l e (or a “ m a j o r i t y v o t i n g e n s e m b l e “ )
is an e n s e m b l e m a c h i n e l e a r n i ng m o d e l t h a t
combines th e predictions f r o m multiple other
mode l s.
• I t is a t e c h n i q u e t h a t m a y b e used t o i m p r o v e m o d e l
p e r f o r m a n c e , ideally achieving b e t t e r p e r f o r m a n c e
t h a n any single m o d e l used in t h e ensemble.
A voting ensemble works by combining the
p r e d i c t i o n s f r o m m u l t i p l e model s.
• I t can b e u se d f o r classification o r regression.
Voting
Voting

• In t h e case o f regression, thi s involves


calculating t h e average o f t h e p r e d i c t i o n s f r o m
t h e model s.
• In t h e case o f classification, t h e p r e d i c t i o n s f o r
each label are s u m m e d a n d t h e label w i t h t h e
m a j o r i t y v o t e is p r e d i c t e d .
– Regression Vo t i n g Ensemble: Predictions are
t h e average o f c o n t r i b u t i n g model s .
– Classification Vo t i n g Ensemble: Predictions
are t h e m a j o r i t y v o t e o f c o n t r i b u t i n g model s .
Voting

• Th e re are t w o approaches t o t h e m a j o r i t y v o t e p r e d i c t i o n
f o r classification; t h e y are h a r d v o t i n g a n d s o f t v o t i n g .
• H a r d v o t i n g involves s u m m i n g t h e p r e d ic t i o n s f o r each class
label a n d p r e d i c t i n g t h e class la b e l w i t h t h e m o s t votes.
S o f t v o t i n g involves s u m m i n g t h e p r e d i c t e d p r o b a b i l i t i e s
(o r p ro b a b ilit y -lik e scores) f o r each class label a n d
p r e d i c t i n g t h e class la b e l w i t h t h e la rg e st p r o b a b i l i t y.
– H a r d Vo t i n g . P r e d i c t t h e class w i t h t h e la rg e st s u m o f
votes f r o m models
– S o f t Vo t i n g . P r e d i c t t h e class w i t h t h e la rg e st s u m m e d
p r o b a b i l i t y f r o m mo d e ls.
Voting

• Use v o t i n g ensembl es w h e n :
– Al l m o d e l s in t h e e n s e m b l e have gener al l y t h e
same g o o d p e r f o r m a n c e .
– Al l m o d e l s in t h e e n s e m b l e m o s t l y already
agree.
• H a r d v o t i n g is a p p r o p r i a t e w h e n t h e m o d e l s
used in t h e v o t i n g e n s e m b l e p r e d i c t crisp class
labels. S o f t v o t i n g is a p p r o p r i a t e w h e n t h e
m o d e l s used in t h e v o t i n g e n s e m b l e p r e d i c t t h e
p r o b a b i l i t y o f class m e m b e r s h i p .
Voting

• S o f t v o t i n g can b e used f o r m o d e l s t h a t d o n o t natively


p r e d i c t a class m e m b e r s h i p p r o b a b i l i t y, a l t h o u g h m a y
r e q u i r e calibration o f t h e i r probability -like scores p r i o r
t o b e i n g used in t h e e n s e m b l e (e.g. s u p p o r t v e c t o r
machine, k-nearest neighbors, a n d decision trees).
– H a r d v o t i n g is f o r m o d e l s t h a t p r e d i c t class labels.
– S o f t v o t i n g is f o r m o d e l s t h a t p r e d i c t class
m e m b e r s h i p probabilities.
• The v o t i n g e n s e m b l e is n o t g u a r a n t e e d t o p r o v i d e
b e t t e r p e r f o r m a n c e t h a n any single m o d e l used in t h e
ensemble.
Voting

• Use a v o t i n g e n s e m b l e if:
– I t results in b e t t e r p e r f o r m a n c e t h a n any m o d e l
u s e d in t h e e n s e mb l e .
– I t results in a l o w e r variance t h a n any m o d e l u s e d in
t h e ensemble.
• A v o t i n g e n s e m b l e is p a r t i c u l arl y u s e f u l f o r m a c h i n e
l e a r n i n g m o d e l s t h a t use a stochastic l e a r n i n g
a l g o r i t h m a n d r e s u l t in a d i f f e r e n t f i n a l m o d e l each
t i m e i t is t r a i n e d o n t h e s a me d a t a s e t .
• O n e e x a m p l e is n e u r a l n e t w o r k s t h a t are f i t using
stochastic g r a d i e n t descent.
U seful resources

• w w w. m i t u . c o . i n
• w w w. p y t h o n p r o g r a m m i n g l a n g u a g e . c o m
• www.scikit -learn.org
• w w w. t o w a r d s d a t a s c i e n c e .c o m
• w w w. m e d i u m . c o m
• www.analyticsvidhya.com
• w w w. k a g g l e . c o m
• w w w .ste p h ackin g .com
• w w w. g i t h u b . c o m

You might also like