Ensemble Learning Methods
Ensemble Learning
• A n e n s e m b l e is a c o m p o s i t e m o d e l , c o m b i n e s a series o f l o w
p e r f o r m i n g classifiers w i t h t h e a i m o f c re a t i n g an i m p r o v e d classifier.
• Here, individual classifier v o t e a n d f i n a l p r e d i c t i o n label r e t u r n e d t h a t
performs majority voting.
• Ensembles o f f e r m o r e accuracy t h a n individual o r base classifier.
• Ensemble m e t h o d s can parallelize b y allocating each base learner t o
d i f f e r e n t - d i f f e r e n t machines.
• Finally, y o u can say Ensemble learning m e t h o d s are m e t a - a l g o r i t h m s
t h a t c o m b i n e several machine learning m e t h o d s i n t o a single
p r e d i c t iv e m o d e l t o increase p e r f o r m a n c e .
• Ensemble m e t h o d s can decrease variance using b a g g i n g approach,
bias using a b o o s t i n g approach, o r i m p r o v e p r e d i c t i o ns using stacking
approach.
Real l i f e examp l e s
• Let's t a k e a real e x a m p l e t o b u i l d t h e i n t u i t i o n .
• Suppose, y o u w a n t t o invest in a c o m p a n y XYZ.
You are n o t sure a b o u t its p e r f o r m a n c e t h o u g h .
• So, y o u l o o k f o r advice o n w h e t h e r t h e s toc k
price w i l l increase b y m o r e t h a n 6 % p e r a n n u m
o r not?
• You decide t o approach various e x p e r t s having
diverse d o m a i n experience:
The survey p r e d i c t i o n
• E mployee o f Company XYZ:
– In t h e past, h e has b e e n r i g h t 7 0 % times.
• Financial Advisor o f Com pany XYZ:
– In t h e past, h e has b e e n r i g h t 7 5 % times.
• Stock M a r k e t Trader:
– In t h e past, h e has b e e n r i g h t 7 0 % times.
• E m ployee o f a c o m p e t i t o r :
– In t h e past, h e has b e e n r i g h t 6 0 % times.
• M a r k e t Research t e a m in t h e same s e g m e n t:
– In t h e past, h e has b e e n r i g h t 7 5 % times.
• Social M e d i a Expert:
– In t h e past, h e has b e e n r i g h t 6 5 % times.
Conclusion
• Given t h e b r o a d s p e c t r u m o f access y o u have, y o u can
p r o b a b l y c o m b i n e all t h e i n f o r m a t i o n a n d m a k e an i n f o r m e d
decision.
• In a scenario w h e n all t h e 6 e x p e r t s / t e a m s v e r i f y t h a t it’s a
g o o d decision(assuming all t h e p r e d i c t i o n s a r e i n d e p e n d e n t o f
each o t h e r) , y o u w i l l g e t a c o m b i n e d accuracy r a t e o f 1 - ( 3 0 % .
2 5 % . 3 0 % . 4 0 % . 2 5 % . 3 5 %) = 1 - 0 .0 7 8 7 5 = 9 9 . 9 2 1 2 5 %
• T h e a s s u m p t i o n u se d h e r e t h a t all t h e p r e d i c t i o n s a r e
c o m p l e t e l y i n d e p e n d e n t is sl i g h t l y e x t r e m e as t h e y a r e
e x p e c t e d t o b e c o r r e l a t e d . H o w e v e r, y o u can see h o w w e can
b e so sure b y c o m b i n i n g various fo re ca sts t o g e t h e r.
• Well, E n s e m b l e l e a r n i n g is no d i f f e r e n t .
Ensembling
• A n e n s e m b l e is t h e a r t o f c o m b i n i n g a diverse s et o f
learners (individual mo d e l s ) t o g e t h e r t o i mp ro v i se
o n t h e stability a n d p re d i c ti v e p o w e r o f t h e m o d e l .
• In o u r example, t h e w a y w e c o m b i n e all t h e
predi c tio ns collectively w i l l b e t e r m e d as Ensemble
learning.
• M o r e o v e r , Ensemble-based m o d e l s can b e
i n c o r p o r a t e d in b o t h o f t h e t w o scenarios, i.e., w h e n
d a t a is o f large v o l u m e a n d w h e n d a t a is t o o l i ttl e .
Ensemble Schematic
Basic Ensemble S t r u c t u r e
Summary
• Use multiple learning algorithms (classifiers)
• Combine the decisions
• Can be more accurate than the individual classifiers
• Generate a group of base-learners
• Different learners use different
– Algorithms
– Hyperparameters
– Representations (Modalities)
– Training sets
H o w m o d e l s are d i f f e r e n t ?
• Difference in population
• Difference in hypothesis
• Difference in modeling technique
• Difference in initial seed
W h y ensembles ?
• There are two main reasons to use an ensemble over
a single model, and they are related; they are:
– Performance: An ensemble can make better
predictions and achieve better performance than
any single contributing model.
– Robustness: An ensemble reduces the spread or
dispersion of the predictions and model
performance.
M o d e l Error
• The e r r o r e m e r g i n g f r o m any m a c h i n e m o d e l can b e b r o k e n
d o w n i n t o t h r e e c o m p o n e n t s m a t h e m a t i c a l l y. F o l l o w i n g are
these component:
• Bias error is u s e f u l t o q u a n t i f y h o w m u c h o n an average are t h e
p r e d i c t e d values d i f f e r e n t f r o m t h e actual value. A h i g h bias
e r r o r m e a n s w e have an u n d e r - p e r f o r m i n g m o d e l w h i c h k e e p s
o n missing essential t r e n d s .
• Variance o n t h e o t h e r side q u a n t i f i e s h o w are t h e p r e d i c t i o n
m a d e o n t h e s a m e o b s e r v a t i o n d i f f e r e n t f r o m each o t h e r. A
h i g h variance m o d e l w i l l o v e r - f i t o n y o u r t r a i n i n g p o p u l a t i o n a n d
p e r f o r m p o o r l y o n any o b s e r v a t i o n b e y o n d training.
Bias a n d Variance
Bias – Variance T r a d e o f f
• Normally, as you increase the complexity of your model,
you will see a reduction in error due to lower bias in the
model. However, this only happens till a particular point.
• As you continue to make your model more complex, you
end up over-fitti ng your model and hence your model will
start suffering from high variance.
• A champion model should maintain a balance between
these two types of errors. This is known as the trade-off
management of bias-variance errors. Ensemble learning is
one way to execute this trade off analysis.
Bias – Variance T r a d e o f f
Ensemble Cre a ti o n Ap p ro a ch e s
• Unweighted Voting (e.g. Bagging)
• Weighted voting – based on accuracy (e.g.
Boosting), Expertise, etc.
• Stacking - Learn the combination function
Boosting vs Bagging
Ensemble Learning M e t h o d s
• B a g g in g
• B oo st in g
• Stacking
Bootstrap
• The b o o t s t r a p is a p o w e r f u l statistical m e t h o d f o r
e s t i m a t i n g a q u a n t i t y f r o m a d a t a sample. This is
easiest t o u n d e r s t a n d i f t h e q u a n t i t y is a descriptive
statisti c such as a m e a n o r a s t a n d a r d deviation.
• Let’s assume w e have a s a m p l e o f 1 0 0 values (x) a n d
w e ’ d like t o g e t an e s t i m a t e o f t h e m e a n o f t h e
sample.
• W e can calculate t h e m e a n d i r e c t l y f r o m t h e s a m p l e
as:
mean(x) = 1 / 1 0 0 * sum(x)
Bootstrap
• W e k n o w t h a t o u r s a mp le is small a n d t h a t o u r
m e a n has e r r o r in it. W e can i m p r o v e t h e e s t i m a t e
o f o u r m e a n using t h e b o o t s t r a p proc e d u re :
• Create m a n y (e.g. 1000) r a n d o m sub-samples o f
o u r d a t a s e t w i t h r e p l a c e m e n t ( m e a n i n g w e can
select t h e same value m u l t i p l e times).
• Calculate t h e m e a n o f each sub-sample.
• Calculate t h e average o f all o f o u r c o l l e c t e d means
a n d use t h a t as o u r e s t i m a t e d m e a n f o r t h e data.
Bootstrap
• For example, let’s say w e used 3 resamples a n d
g o t t h e m e a n values 2.3, 4.5 a n d 3.3.
• Taking t h e average o f t h e s e w e c o u l d t a k e t h e
e s t i m a t e d m e a n o f t h e d a t a t o b e 3.367.
• This process can b e used t o e s t i m a t e o t h e r
q u a n t i t i e s like t h e s t a n d a r d d e v ia t io n a n d ev en
q u a n t i t i e s used in mac hine learning algorith ms ,
like lear ned coefficients.
Bagging
• B a g g i n g stands f o r b o o t s t r a p aggrega tio n .
• I t c o m b i n e s m u l t i p l e learners in a w a y t o r e d u c e
t h e variance o f estimates.
• For example, r a n d o m f o r e s t trains M Decision
Tree, y o u can t r a i n M d i f f e r e n t t r e e s o n d i f f e r e n t
r a n d o m subsets o f t h e d a t a a n d p e r f o r m v o t i n g f o r
final predic tion.
• Example:
– R a n d o m Forest
– Extra Trees.
Bagging
R a n d o m Fore st
• R a n d o m f o r e s t is a t y p e o f supervised m achi ne l earni ng
a l g o r i t h m based o n e n s e m b l e learning.
• E nsem bl e l earni ng is a t y p e o f l earni ng w h e r e y o u j oi n
d i f f e r e n t t y p e s o f a l g o r i t h m s o r sam e a l g o r i t h m m u l t i p l e
times t o f orm a more powerful prediction model.
• T he r a n d o m f o r e s t a l g o r i t h m c o m b i n e s m u l t i p l e
a l g o r i t h m o f t h e sam e t y p e i.e. m u l t i p l e decision trees,
r e s u l t i n g in a f o r e s t o f trees, hence t h e n a m e " R a n d o m
Forest".
• T he r a n d o m f o r e s t a l g o r i t h m can b e used f o r b o t h
regression a n d classification tasks.
How it works ?
• Pick N r a n d o m records f r o m t h e dataset.
• B u i l d a decision t r e e based o n t h e s e N records.
• Choose t h e n u m b e r o f t r e e s y o u w a n t in y o u r a l g o r i t h m
a n d r e p e a t ste p s 1 a n d 2.
• In case o f a regression p r o b l e m , f o r a n e w record, each
t r e e in t h e f o r e s t p r e d i c t s a value f o r Y ( o u t p u t ) . The f i n a l
value can b e c a lc ula ted b y t a k i n g t h e average o f all t h e
values p r e d i c t e d b y all t h e t r e e s in f o r e s t . Or, in case o f a
classification p r o b l e m , each t r e e in t h e f o r e s t p r e d icts t h e
c a t e g o r y t o w h i c h t h e n e w r e c o r d belongs. Finally, t h e
n e w r e c o r d is assigned t o t h e c a t e g o r y t h a t w i n s t h e
majority vote.
•
Majority Voting
Source: medium.com
Regressor O u t p u t
Source: medium.com
Boosting
• B o o s t i n g a l g o r i t h m s are a s e t o f t h e l o w accurate classifier t o
c r e a t e a h i g h l y accurate classifier.
• L o w accuracy classifier ( o r w e a k classifier) o f f e r s t h e accuracy
b e t t e r t h a n t h e f l i p p i n g o f a coin.
• This is d o n e b y b u i l d i n g a m o d e l f r o m t h e t r a i n i n g data, t h e n
creating a second model t h a t attempts t o correct t h e errors
f r o m t h e f i r s t m o d e l . M o d e l s are a d d e d u n t i l t h e t r a i n i n g s e t is
p r e d i c t e d p e r f e c t l y o r a m a x i m u m n u m b e r o f m o d e l s are a d d e d .
• H i g h l y accurate classifier( o r s t r o n g classifier) o f f e r e r r o r r a t e
close t o 0. B o o s t i n g a l g o r i t h m can t r a c k t h e m o d e l w h o f a i l e d
t h e accurate p r e d i c t i o n .
• B o o s t i n g a l g o r i t h m s are less a f f e c t e d b y t h e o v e r f i t t i n g
problem.
Boosting Models
• M o d e l s t h a t are typically used in B o o s t i n g
t e c h n i q u e are:
– XGBoost ( Ex t r e m e Gradient Bo o s t i n g )
– G B M (Gradient B o o s t i n g Machine)
– A D A B o o s t ( Ad a p t iv e Bo o s t in g )
A d a b o o s t Sum m a r y
• Initially, A d a b o o s t selects a t r a i n i n g subset ra n d o m l y.
• I t i t e ra t i v el y trains t h e A d a B o o s t machine learning m o d e l b y selecting
t h e t r a i n i n g s et based o n t h e accurate p r e d i c t i o n o f t h e last training.
• I t assigns t h e hi gher w e i g h t t o w r o n g classified observations so t h a t in
t h e n e x t i t e r a t i o n t hes e observations w i l l g e t t h e hi gh p r o b a b i l i t y f o r
classification.
• Also, I t assigns t h e w e i g h t t o t h e t r a i n e d classifier in each i t e r a t i o n
according t o t h e accuracy o f t h e classifier. The m o r e accurate classifier
w i l l g e t hi gh w e i g h t .
• This process i t e r a t e u n t i l t h e c o m p l e t e t r a i n i n g d a t a f i t s w i t h o u t any
e r r o r o r u n t i l reached t o t h e specified m a x i m u m n u m b e r o f estimators.
• To classify, p e r f o r m a " v o t e " across all o f t h e learning a l g o r i t h m s y o u
built.
H o w A d a b o o s t Works?
Adaboost Classifier
Weighting function
Adaboost regressor
from sklearn.ensemble import
AdaBoostRegressor
from sklearn.tree import
DecisionTreeRegressor
regr_1 = DecisionTreeRegressor(max_depth=4)
regr_2 = AdaBoostRegressor(
DecisionTreeRegressor(max_depth=4),
n_estimators=300, random_state=rng
)
regr_1.fit(X, y)
regr_2.fit(X, y)
y_1 = regr_1.predict(X)
y_2 = regr_2.predict(X)
C omparison
Stacking
• Stacked Generalization o r “ Stacki ng” f o r s h o r t is
an e n s e m b l e m a c h i n e l ear ni ng a l g o r i t h m .
• I t involves c o m b i n i n g t h e p r e d i c t i o n s f r o m
m u l t i p l e m a c h i n e l ear ni ng m o d e l s o n t h e same
dataset, like b a g g i n g a n d b o o s t i n g .
• Stacking addresses t h e questi on:
– Given m u l t i p l e m a c h i n e l e a r n i n g m o d e l s t h a t
are skillful o n a p r o b l e m , b u t in d i f f e r e n t
ways, h o w d o y o u choose w h i c h m o d e l t o use
(trust)?
Stacking
• The a p p r o a c h t o t his q u e s t i o n is t o use a n o t h e r m a c h i n e
lear ni n g m o d e l t h a t learns w h e n t o use o r t r u s t each
m o d e l in t h e ensem b l e.
– Unlike bagging, in stacking, t h e m o d e l s ar e typically
d i f f e r e n t (e.g. n o t all decision t r ees) a n d f i t o n t h e
same d a t a s e t (e.g. inst ea d o f samples o f t h e t r a i n i n g
dataset).
– Unlike b o o s t i n g , in stacking, a single m o d e l is used t o
lear n h o w t o b e s t c o m b i n e t h e p r e d i c t i o n s f r o m t h e
c o n t r i b u t i n g m o d e l s (e.g. inst ea d o f a sequence o f
m o d e l s t h a t c o r r e c t t h e p r e d i c t i o n s o f p r i o r models).
Stacking
• The a rc h i te c tu re o f a stacking m o d e l involves t w o o r
m o r e base models, o f t e n r e f e r r e d t o as level-0
models, a n d a m e t a - m o d e l t h a t c o mb i n es t h e
p re d i c tio ns o f t h e base models, r e f e r r e d t o as a
level-1 m o d e l .
– Level-0 M o d e l s (Base-Models): M o d e l s f i t o n t h e
t r a i n i n g d a t a a n d w h o s e p re d i c ti on s are
compiled.
– Level-1 M o d e l (Meta -Model): M o d e l t h a t learns
h o w t o b e s t c o m b i n e t h e p re d i c tion s o f t h e base
models.
Stacking
Stacking
• The m e t a - m o d e l is t r a i n e d o n t h e p r e d i c t i o n s m a d e
b y base m o d e l s o n o u t - o f - s a m p l e data.
• T h a t is, d a t a n o t used t o t r a i n t h e base m o d e l s is f e d
t o t h e base mo d els, p r e d i c t i o n s are m a d e , a n d t h e s e
predicti o n s, a l o n g w i t h t h e e x p e c t e d o u t p u t s ,
p r o v i d e t h e i n p u t a n d o u t p u t pairs o f t h e t r a i n i n g
d a t a s e t u se d t o f i t t h e m e t a - m o d e l .
• The o u t p u t s f r o m t h e base m o d e l s used as i n p u t t o
t h e m e t a - m o d e l m a y b e real value in t h e case o f
regression, a n d p r o b a b i l i t y values, p r o b a b i l i t y like
values, o r class labels in t h e case o f classification.
Voting
• Vo t i n g is an e n s e m b l e machine learning a l g o r i t h m .
• For regression, a v o t i n g e n s e m b l e involves m a k i n g a
p r e d i c t i o n t h a t is t h e average o f m u l t i p l e o t h e r
regression models.
• In classification, a h a r d v o t i n g e n s e m b l e involves
s u m m i n g t h e v o t e s f o r crisp class labels f r o m o t h e r
m o d e l s a n d p r e d i c t i n g t h e class w i t h t h e m o s t votes.
• A s o f t v o t i n g e n s e m b l e involves s u m m i n g t h e
p r e d i c t e d p ro b a bilities f o r class labels a n d p r e d i c t i n g
t h e class label w i t h t h e l argest s u m probability.
Voting
• A v o t i n g e n s e m b l e (or a “ m a j o r i t y v o t i n g e n s e m b l e “ )
is an e n s e m b l e m a c h i n e l e a r n i ng m o d e l t h a t
combines th e predictions f r o m multiple other
mode l s.
• I t is a t e c h n i q u e t h a t m a y b e used t o i m p r o v e m o d e l
p e r f o r m a n c e , ideally achieving b e t t e r p e r f o r m a n c e
t h a n any single m o d e l used in t h e ensemble.
A voting ensemble works by combining the
p r e d i c t i o n s f r o m m u l t i p l e model s.
• I t can b e u se d f o r classification o r regression.
Voting
Voting
• In t h e case o f regression, thi s involves
calculating t h e average o f t h e p r e d i c t i o n s f r o m
t h e model s.
• In t h e case o f classification, t h e p r e d i c t i o n s f o r
each label are s u m m e d a n d t h e label w i t h t h e
m a j o r i t y v o t e is p r e d i c t e d .
– Regression Vo t i n g Ensemble: Predictions are
t h e average o f c o n t r i b u t i n g model s .
– Classification Vo t i n g Ensemble: Predictions
are t h e m a j o r i t y v o t e o f c o n t r i b u t i n g model s .
Voting
• Th e re are t w o approaches t o t h e m a j o r i t y v o t e p r e d i c t i o n
f o r classification; t h e y are h a r d v o t i n g a n d s o f t v o t i n g .
• H a r d v o t i n g involves s u m m i n g t h e p r e d ic t i o n s f o r each class
label a n d p r e d i c t i n g t h e class la b e l w i t h t h e m o s t votes.
S o f t v o t i n g involves s u m m i n g t h e p r e d i c t e d p r o b a b i l i t i e s
(o r p ro b a b ilit y -lik e scores) f o r each class label a n d
p r e d i c t i n g t h e class la b e l w i t h t h e la rg e st p r o b a b i l i t y.
– H a r d Vo t i n g . P r e d i c t t h e class w i t h t h e la rg e st s u m o f
votes f r o m models
– S o f t Vo t i n g . P r e d i c t t h e class w i t h t h e la rg e st s u m m e d
p r o b a b i l i t y f r o m mo d e ls.
Voting
• Use v o t i n g ensembl es w h e n :
– Al l m o d e l s in t h e e n s e m b l e have gener al l y t h e
same g o o d p e r f o r m a n c e .
– Al l m o d e l s in t h e e n s e m b l e m o s t l y already
agree.
• H a r d v o t i n g is a p p r o p r i a t e w h e n t h e m o d e l s
used in t h e v o t i n g e n s e m b l e p r e d i c t crisp class
labels. S o f t v o t i n g is a p p r o p r i a t e w h e n t h e
m o d e l s used in t h e v o t i n g e n s e m b l e p r e d i c t t h e
p r o b a b i l i t y o f class m e m b e r s h i p .
Voting
• S o f t v o t i n g can b e used f o r m o d e l s t h a t d o n o t natively
p r e d i c t a class m e m b e r s h i p p r o b a b i l i t y, a l t h o u g h m a y
r e q u i r e calibration o f t h e i r probability -like scores p r i o r
t o b e i n g used in t h e e n s e m b l e (e.g. s u p p o r t v e c t o r
machine, k-nearest neighbors, a n d decision trees).
– H a r d v o t i n g is f o r m o d e l s t h a t p r e d i c t class labels.
– S o f t v o t i n g is f o r m o d e l s t h a t p r e d i c t class
m e m b e r s h i p probabilities.
• The v o t i n g e n s e m b l e is n o t g u a r a n t e e d t o p r o v i d e
b e t t e r p e r f o r m a n c e t h a n any single m o d e l used in t h e
ensemble.
Voting
• Use a v o t i n g e n s e m b l e if:
– I t results in b e t t e r p e r f o r m a n c e t h a n any m o d e l
u s e d in t h e e n s e mb l e .
– I t results in a l o w e r variance t h a n any m o d e l u s e d in
t h e ensemble.
• A v o t i n g e n s e m b l e is p a r t i c u l arl y u s e f u l f o r m a c h i n e
l e a r n i n g m o d e l s t h a t use a stochastic l e a r n i n g
a l g o r i t h m a n d r e s u l t in a d i f f e r e n t f i n a l m o d e l each
t i m e i t is t r a i n e d o n t h e s a me d a t a s e t .
• O n e e x a m p l e is n e u r a l n e t w o r k s t h a t are f i t using
stochastic g r a d i e n t descent.
U seful resources
• w w w. m i t u . c o . i n
• w w w. p y t h o n p r o g r a m m i n g l a n g u a g e . c o m
• www.scikit -learn.org
• w w w. t o w a r d s d a t a s c i e n c e .c o m
• w w w. m e d i u m . c o m
• www.analyticsvidhya.com
• w w w. k a g g l e . c o m
• w w w .ste p h ackin g .com
• w w w. g i t h u b . c o m