0% found this document useful (0 votes)

11 views7 pages

Boosting Reduces Bias

The document discusses the concept of boosting in machine learning, specifically how weak learners can be combined to form a strong learner with reduced bias. It explains the iterative process of constructing ensemble classifiers using techniques like gradient descent and introduces specific algorithms such as Gradient Boosted Regression Trees (GBRT) and AdaBoost. The document also covers the mathematical foundations and optimization strategies involved in these boosting methods.

Uploaded by

chakumchukum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views7 pages

Boosting Reduces Bias

Uploaded by

chakumchukum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

19: Boosting

previous back next

Boosting reduces Bias

Machine Learning Lecture 32 "Boosting" -Cornell CS4780 S…

S…

Video II Video III

Scenario: Hypothesis class H, whose set of classifiers has large bias and the training error is high (e.g. CART
trees with very limited depth.)
Famous question: In his machine learning class project in 1988 Michael Kearns famously asked the question: Can
weak learners (H ) be combined to generate a strong learner with low bias?
Famous answer: Yes! (Robert Schapire in 1990)
T
Solution: Create ensemble classifier HT (x⃗) = ∑
t=1
αt ht (x⃗) . This ensemble classifier is built in an iterative
fashion. In iteration t we add the classifier αt ht (x⃗) to the ensemble. At test time we evaluate all classifier and
return the weighted sum.
The process of constructing such an ensemble in a stage-wise fashion is very similar to gradient descent.
However, instead of updating the model parameters in each iteration, we add functions to our ensemble.
Let ℓ denote a (convex and differentiable) loss function. With a little abuse of notation we write

n
1
ℓ(H ) = ∑ ℓ(H (xi ), yi ).
n
i=1

Assume we have already finished t iterations and already have an ensemble classifier Ht (x⃗) . Now in iteration
t + 1 we want to add one more weak learner ht+1 to the ensemble. To this end we search for the weak learner
that minimizes the loss the most,

ht+1 = argmin ℓ(Ht + αht ).

h∈H

Once ht+1 has been found, we add it to our ensemble, i.e. Ht+1 := Ht + αh .
How can we find such h ∈ H ?
Answer: Use gradient descent in function space. In function space, inner product can be defined as
n
< h, g >= ∫ h(x)g(x)dx . Since we only have training set, we define < h, g >= ∑
i=1
h(xi )g(xi ) .
x

Gradient descent in functional space

Given H , we want to find the step-size α and (weak learner) h to minimize the loss ℓ(H + αh) . Use Taylor
Approximation on ℓ(H + αh) .

ℓ(H + αh) ≈ ℓ(H ) + α < ∇ℓ(H ), h >.

This approximation (of ℓ as a linear function) only holds within a small region around ℓ(H ) , i. as long as α is
small. We therefore fix it to a small constant (e.g. α ≈ 0.1 ). With the step-size α fixed, we can use the
approximation above to find an almost optimal h:

n
∂ℓ
argmin ℓ(H + αh) ≈ argmin < ∇ℓ(H ), h >= argmin ∑ h(xi )
h∈H h∈H h∈H
∂[H (xi )]
i=1

n
We can write ℓ(H ) = ∑
i=1
ℓ(H (xi )) = ℓ(H (x1 ), . . . , H (xn )) (each prediction is an input to the loss
function)
∂ℓ ∂ℓ
(xi ) =
∂H ∂[H(xi )]

So we can do boosting if we have an algorithm A to solve

n ∂ℓ
ht+1 = argmin ∑ h(x)
h∈H i=1
∂[H(xi )]

ri

n
We need a function A({(x1 , r1 ), … , (xn , rn )}) = argminh∈H ∑ ri h(xi ) . In
i=1

order to make progress this h does not have to be great. We still make progress as long as
n
∑ ri h(xi ) < 0 .
i=1

Generic boosting (a.k.a Anyboost)

Case study #1: Gradient Boosted Regression Tree(GBRT)

k
Classification (yi ∈ {+1, −1} ) or (even multi-dimensional) regression (yi ∈ R )

Weak learners,h ∈ H , are regressors, h(x) ∈ R, ∀x , typically fixed-depth (e.g. depth=4)

regression trees (hence the name).
Step size α is fixed to a small constant (hyper-parameter).
Loss function: Any differentiable convex loss that decomposes over the samples
n
L(H ) = ∑ ℓ(H (xi ))
i=1
In order to use regression trees for gradient boosting, we must be able to find a tree h() that maximizes
n ∂ℓ
h = argmin
h∈H
∑
i=1
ri h(xi ) where ri = .
∂H(xi )

We will make two assumptions:

n 2
1. First, we assume that ∑ h (xi ) = constant. This is simple to do (we normalize the predictions)
i=1
n
and important because we could always decrease ∑ h(xi )ri by rescaling h with a large
i=1
n s
constant. By fixing ∑ h (xi ) to a constant we are essentially fixing the vector h to lie on a circle,
i=1
and we are only concerned with its direction but not its length.

2. CART trees are negation closed, i.e. ∀ h ∈ H => ∃ − h ∈ H . (This is generally true.)
3. We can define the negative graident as ti = −ri .

n
argminh∈H ∑ ri h(xi ) (This is the original AnyBoost formulation.)
i=1
n
= argmin − 2∑ ti h(xi ) (Swapping in ti for −ri and multiplying by 2, which is a
h∈H i=1
constant.)
n 2 2
= argmin ∑ t − 2ti h(xi ) + (h(xi )) (Adding constant
h∈H i=1 i
 
constant constant

2 2
∑ t + h(xi ) .)
i i
n 2
= argmin ∑ (h(xi ) − ti )
h∈H i=1
In other words, we can use the good old Regression trees and feed in the value ti as labels for each xi . Each
iteration we build a new tree for a different set of "labels" t1 , … , tn .

1 n 2
If the loss function ℓ is the squared loss, i.e. ℓ(H ) = ∑
i=1
(H (xi ) − yi ) , then it is easy to show that
2

∂ℓ
ti = − = yi − H (xi ),
H (xi )

which is simply the residual, i.e. r is the vector pointing from y to H. However, it is important that you can use any
other differentiable and convex loss function ℓ , and the solution for your next weak learner h() will always be the
regression tree minimizing the squared loss.

GBRT in Pseudo Code

Case Study #2: AdaBoost

Setting: Classification (yi ∈ {+1, −1} )

Weak learners: h ∈ H are binary, h(xi ) ∈ {−1, +1}, ∀x

Step-size: We perform line-search to obtain best step-size α .
n −yi H(xi )
Loss function: Exponential loss ℓ(H ) = ∑ e
i=1

Finding the best weak learner

∂ℓ −yi H(xi )
First we compute the gradient ri = = −yi e .
∂H(xi )

For notational convenience (and for reason that will become clear in a little bit), let us define
1 −yi H(xi ) n −yi H(xi )
wi = e , where Z = ∑ e is a normalizing factor so that
Z i=1
n
∑ wi = 1. Note that the normalizing constant Z is identical to the loss function. Each weight wi
i=1

therefore has a very nice interpretation. It is the relative contribution of the training point (xi , yi ) towards the
overall loss.

In order to find the best next weak learner, we need to solve the optimization problem: (in the following, we will
make use of the fact that h(xi ) ∈ {+1, −1} .)

−H(xi )yi
h(xi ) = argmin ∑ ri h(xi ) (substitute in: ri = e )
h∈H

i=1

n
1
−H(xi )y −H(xi )y
i i
= argminh∈H − ∑ yi e h(xi ) (substitute in: wi = e )
Z
i=1

= argmin − ∑ wi yi h(xi ) (yi h(xi ) ∈ {+1, −1} with h(xi )yi = 1 ⟺ h(xi ) = yi )
h∈H

i=1

= argmin ∑ wi − ∑ wi ( ∑ wi = 1 − ∑ wi )
h∈H

i:h(xi )≠y i:h(xi )=y i:h(xi )=y i:h(xi )≠y

i i i i

= argminh∈H ∑ wi (This is the weighted classification error.)

i:h(xi )≠yi

Let us denote this weighted classification error as ϵ = ∑ wi . So for AdaBoost, we only

i:h(xi )yi =−1
need a classifier that can take training data and a distribution over the training set (i.e. normalzied weights wi for
all training samples) and which returns a classifier h ∈ H that reduces the weighted classification error of
these training samples. It doesn't have to do all that well, in order for the inner-product ∑ ri h(xi ) to be
i

negative, it just needs less than ϵ < 0.5 weighted training error.

Finding the stepsize α

In the previous example, GBRT, we set the stepsize α to be a small constant. As it turns out, in the AdaBoost
setting we can find the optimal stepsize (i.e. the one that minimizes ℓ the most) in closed form every time we take
a "gradient" step.

When we are given ℓ, H , h , we would like to solve the following optimization problem:

α = argminα ℓ(H + αh)

−yi [H(xi )+αh(xi )]

= argmin ∑e
α

i=1

We differentiate w.r.t. α and equate with zero:

n
−(y H(xi )+αy h(xi ))
i i
∑ yi h(xi )e = 0 (yi h(xi ) ∈ {+1, −1})

i=1

−(yi H(xi )+α yi h(xi ) ) −(y H(xi )+α y h(xi ) )

i i

 1
−y H(xi )
−1 i
− ∑ e 1
+ ∑ e = 0 (wi = e )
Z
i:h(xi )yi =1 i:h(xi )yi =−1

−α α
− ∑ wi e + ∑ wi e = 0 (ϵ = ∑ wi )

i:h(xi )y =1 i:h(xi )y =−1 i:h(xi )y =−1

i i i

−α α
−(1 − ϵ)e + ϵe = 0

1−ϵ
2α
e =
ϵ

1 1−ϵ
α = ln
2 ϵ

It is unusual that we can find the optimal step-size in such a simple closed form. One consequence is that
AdaBoost converges extremely fast.

Re-normalization

After you take a step, i.e. Ht+1 = Ht + αh , you need to re-compute all the weights and then re-normalize. It is
however straight-forward to show that the unnormalized weight w
^ is updated as
i

−αh(xi )yi
^ ← w
w ^ ∗e
i i

and that the normalizer Z becomes

−−−−−−
Z ← Z ∗ 2√ ϵ(1 − ϵ) .

Putting these two together we obtain the following multiplicative update rule:

−αh(xi )yi
e
wi ← wi .
− −−−−−
2√ ϵ(1 − ϵ)

AdaBoost Pseudo-code

A few remarks:
As long as H is negation closed (this means for every h ∈ H we must also have −h ∈ H ), it
1
cannot be that the error ϵ > . The reason is simply that if h has error ϵ, it must be that −h has error
2

1 − ϵ. So you could just flip h to −h and obtain a classifier with smaller error. As h was found by
minimizing the error, this is a contradiction.
1 1
The inner loop can terminate as the error ϵ = , and in most cases it will converge to over time. In
2 2

that case the latest weak learner h is only as good as a coin toss and cannot benefit the ensemble
1
(therefore boosting terminates). Also note that if ϵ = the step-size α would be zero.
2

Further analysis

Let us examine each one of these updates.

The weight update:

−αh(xi )yi
^i ← w
w ^i ∗ e ,

as, h(xi )yi is either +1 (if classified correctly by this weak learner) or −1 (otherwise), it follows that
α
this weight update multiplies the weight wi either by a factor e > 1 if it was classified incorrectly (i.e.
−α
increases the weights), or by a factor e < 1 if it was classified correctly (i.e. decreases the weight).
Normalization update:

−−−−−−
Z ← Z ∗ 2√ϵ(1 − ϵ) .

Previously we established that the normalizer Z is identical to the loss. We can therefore use it to bound the
loss function after T iterations:

T
−−−−−−−−
ℓ(H ) = Z = n ∏ 2√ϵt (1 − ϵt ) ,

t=1

1
(the factor n comes from the fact that the initial Z0 = n , when all weights are . ) If we define
n
c = max tϵt , we can establish

−−−−−− T

ℓ(H ) ≤ n[2√c(1 − c) ] .

1 1
The function c(1 − c) is maximized at c = . But we know that each ϵt < (or else the
2 2
1
algorithm would have terminated). Therefore c(1 − c) < and we can re-write it as
4
1 2
c(1 − c) = − γ , for some γ. This leaves us with
4

2 2
ℓ(H ) ≤ n(1 − 4γ ) .

In other words, the training loss is decreasing exponentially!

In fact, we can go even further and compute after how many iterations we must have zero training error. Note
n
that the training loss is an upper bound on the training error (defined as ∑i=1 δH(x i
)≠y
) - simply because
i

−y H(xi )
δH(x
i
)≠y
< e i
in all cases. We can then compute the number of steps required until the loss is less
i

than 1, which would imply that not a single training input is misclassified.
T
2 log(n)
2 2
n (1 − 4γ ) < 1 ⇒ T > .
1
log( )
2
1−4γ

This is an amazing result. It shows that after O(log(n)) iterations your training error must be zero. In
practice it often makes sense to keep boosting even after you make no more mistakes on the training set.

Summary

Boosting is a great way to turn a week classifier into a strong classifier. It defines a whole family of algorithms,
including Gradient Boosting, AdaBoost, LogitBoost, and many others ... Gradient Boosted Regression Trees is one
of the most popular algorithms for Learning to Rank, the branch of machine learning focused on learning ranking
functions, for example for web search engines. A few additional things to know:
The step size α is often referred to as shrinkage.
Some people do not consider gradient boosting algorithms to be part of the boosting family, because they
have no guarantee that the training error decreases exponentially. Often these algorithms are referred to as
stage-wise regression instead.
Inspired by Breiman's Bagging, stochastic gradient boosting subsamples the training data for each weak
learner. This combines the benefits of bagging and boosting. One variant is to subsample only n/2 data
points without replacement, which speeds up the training process.
One advantage of boosted classifiers is that during test time the computation
T
H (x) = ∑ αt ht (xt ) can be stopped prematurely if it becomes clear which way the
t=1
prediction goes. This is particularly interesting in search engines, where the exact ranking of results is
typically only interesting for the top 10 search results. Stopping the evaluation of lower ranked search results
can lead to tremendous speed ups. A similar approach is also used by the Viola-Jones algorithm to speed up
face detection in images. Here, the algorithm scans regions of an image to detect possible faces. As almost
all regions and natural images do not contain faces, there are huge savings if the evaluation can be stopped
after just a few weak learners are evaluated. These classifiers are referred to as cascades, that spend very
little time on the common case (no face), but more time on the rare interesting case (face). With this
approach Viola and Jones were the first to solve face recognition in real-time on low performance hardware
(e.g. cameras).
AdaBoost is an extremely powerful algorithm, that turns any weak learner that can classify any weighted
version of the training set with below 0.5 error into a strong learner whose training error decreases
exponentially and that requires only O(log(n)) steps until it is consistent.

Integration of OpenCV and Cyclone V Hybrid ARM and FPGA SoC For Face Detection A
No ratings yet
Integration of OpenCV and Cyclone V Hybrid ARM and FPGA SoC For Face Detection A
5 pages
Gradient Boosting for Data Scientists
No ratings yet
Gradient Boosting for Data Scientists
19 pages
Ensemble Learning Methods
100% (1)
Ensemble Learning Methods
24 pages
Boosted Trees
No ratings yet
Boosted Trees
66 pages
Boosting Algorithms in Machine Learning
100% (1)
Boosting Algorithms in Machine Learning
41 pages
Adaboost Matas
No ratings yet
Adaboost Matas
136 pages
Boosting
No ratings yet
Boosting
31 pages
14-AI ML Ensemble 2022
No ratings yet
14-AI ML Ensemble 2022
41 pages
Boosting and Applications Yuan
No ratings yet
Boosting and Applications Yuan
41 pages
Ensemble (v6)
No ratings yet
Ensemble (v6)
45 pages
کتاب هفتم بارگزاری شده
No ratings yet
کتاب هفتم بارگزاری شده
57 pages
Deep Hedging of Financial Options
No ratings yet
Deep Hedging of Financial Options
28 pages
LECTURE+NOTES Boosting
No ratings yet
LECTURE+NOTES Boosting
8 pages
Chapter 3 - Boosting Theory
No ratings yet
Chapter 3 - Boosting Theory
7 pages
Boosting Algorithms Explained
No ratings yet
Boosting Algorithms Explained
79 pages
Lecture 10 Boosting
No ratings yet
Lecture 10 Boosting
20 pages
ENG6500 7 Ensembles Boosting
No ratings yet
ENG6500 7 Ensembles Boosting
49 pages
Lec 29
No ratings yet
Lec 29
33 pages
Lecture 16: Boosting - Applied ML
No ratings yet
Lecture 16: Boosting - Applied ML
20 pages
DM - Lecture 4
No ratings yet
DM - Lecture 4
65 pages
Lec5 Boosting v2.7 1
No ratings yet
Lec5 Boosting v2.7 1
46 pages
22 Boosting
No ratings yet
22 Boosting
32 pages
ML8 Ensembles
No ratings yet
ML8 Ensembles
31 pages
Machine Learning Boosting Guide
No ratings yet
Machine Learning Boosting Guide
27 pages
Friedman 2002
No ratings yet
Friedman 2002
12 pages
07 Boosting Notes
No ratings yet
07 Boosting Notes
10 pages
Lecture18 Boosting
No ratings yet
Lecture18 Boosting
21 pages
Boosting
No ratings yet
Boosting
11 pages
Predicting and Interpreting Student Performance Using Ensemble Models and Shapley Additive Explanations
No ratings yet
Predicting and Interpreting Student Performance Using Ensemble Models and Shapley Additive Explanations
16 pages
Boosting Mit
No ratings yet
Boosting Mit
36 pages
Boosting
No ratings yet
Boosting
2 pages
Ensemble
No ratings yet
Ensemble
33 pages
Computational Data Analysis: Machine Learning
No ratings yet
Computational Data Analysis: Machine Learning
26 pages
MLB HA 6 Answers Final
No ratings yet
MLB HA 6 Answers Final
13 pages
Survey - Gradient Boosting Machine
No ratings yet
Survey - Gradient Boosting Machine
9 pages
Gradient Boosting
No ratings yet
Gradient Boosting
17 pages
AdaBoost and Weak Learning Foundations
No ratings yet
AdaBoost and Weak Learning Foundations
41 pages
Gradient Boosting
No ratings yet
Gradient Boosting
20 pages
Introduction To Machine Learning - Boosting
No ratings yet
Introduction To Machine Learning - Boosting
6 pages
Gradient Boosting
No ratings yet
Gradient Boosting
9 pages
CS229 Supplemental Lecture Notes: 1 Boosting
No ratings yet
CS229 Supplemental Lecture Notes: 1 Boosting
11 pages
02
No ratings yet
02
11 pages
A Simple Proof of AdaBoost Algorithm
No ratings yet
A Simple Proof of AdaBoost Algorithm
4 pages
Ensemble Methods for ML Students
No ratings yet
Ensemble Methods for ML Students
28 pages
Sol3 2016
No ratings yet
Sol3 2016
8 pages
Lecture5 FGV
No ratings yet
Lecture5 FGV
25 pages
Boosting
No ratings yet
Boosting
13 pages
16 Boosting
No ratings yet
16 Boosting
7 pages
Foundations of Machine Learning: Courant Institute and Google Research
No ratings yet
Foundations of Machine Learning: Courant Institute and Google Research
42 pages
Lect4 Log Reg
No ratings yet
Lect4 Log Reg
20 pages
Module 3.5 Ensemble Learning XGBoost
No ratings yet
Module 3.5 Ensemble Learning XGBoost
26 pages
09 Boosting
No ratings yet
09 Boosting
17 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
32 pages
Model-Based Boosting in R: Introduction To Gradient Boosting
No ratings yet
Model-Based Boosting in R: Introduction To Gradient Boosting
35 pages
Boosting With The L - Loss: Regression and Classification
No ratings yet
Boosting With The L - Loss: Regression and Classification
32 pages
Artificial Intelligence Fundamentals: Learning: Boosting
No ratings yet
Artificial Intelligence Fundamentals: Learning: Boosting
24 pages
Gradient Boosting in ML
No ratings yet
Gradient Boosting in ML
5 pages
ML 14 Boosting
No ratings yet
ML 14 Boosting
57 pages
Boosting and Classifier Complexity Lecture
No ratings yet
Boosting and Classifier Complexity Lecture
10 pages
Datagiri: Presented 17 November By: Himanshu Shrivastava
No ratings yet
Datagiri: Presented 17 November By: Himanshu Shrivastava
17 pages
1.1 - Xgboost, GBboost, Adaboost - Boosting - Medium
No ratings yet
1.1 - Xgboost, GBboost, Adaboost - Boosting - Medium
6 pages
Big Mart Sales Analysis
No ratings yet
Big Mart Sales Analysis
4 pages
Module 7 - Ensemble Learning
No ratings yet
Module 7 - Ensemble Learning
41 pages
AdaBoost Based Bankruptcy Forecasting of Korean Constr 2014 Applied Soft Com
No ratings yet
AdaBoost Based Bankruptcy Forecasting of Korean Constr 2014 Applied Soft Com
6 pages
Al3451 ML
No ratings yet
Al3451 ML
6 pages
Boosting and AdaBoost For Machine Learning
No ratings yet
Boosting and AdaBoost For Machine Learning
18 pages
ML Loss Functions Explained
No ratings yet
ML Loss Functions Explained
5 pages
Team-7 Project Report
No ratings yet
Team-7 Project Report
59 pages
Ensemble Learning
No ratings yet
Ensemble Learning
22 pages
PDF Malware Detection for Experts
No ratings yet
PDF Malware Detection for Experts
15 pages
Theoretical Evaluation of Ensemble Machine Learning Techniques
No ratings yet
Theoretical Evaluation of Ensemble Machine Learning Techniques
9 pages
November, 2024
No ratings yet
November, 2024
29 pages
ML Unit Wise Important Questions
No ratings yet
ML Unit Wise Important Questions
2 pages
PA Research Papers
No ratings yet
PA Research Papers
5 pages
Classification & Regression Trees Guide
No ratings yet
Classification & Regression Trees Guide
80 pages
Ieee Paper1
No ratings yet
Ieee Paper1
6 pages
A Review On Fake News Detection 3T's: Typology, Time of Detection, Taxonomies
No ratings yet
A Review On Fake News Detection 3T's: Typology, Time of Detection, Taxonomies
36 pages
10999-Manuscript (Word) - 48093-2-15-20231227
No ratings yet
10999-Manuscript (Word) - 48093-2-15-20231227
8 pages
Machine Learning - Unit 2
No ratings yet
Machine Learning - Unit 2
104 pages
Design of An Intrusion Detection Model For IoT-Enabled Smart Home
No ratings yet
Design of An Intrusion Detection Model For IoT-Enabled Smart Home
18 pages
1 s2.0 S2214509522001784 Main
No ratings yet
1 s2.0 S2214509522001784 Main
17 pages
Vision-Based Cleaning Area Control For Cleaning Robots
No ratings yet
Vision-Based Cleaning Area Control For Cleaning Robots
6 pages
Detection of Anomalous Behaviour in An Examination Hall Abstract
No ratings yet
Detection of Anomalous Behaviour in An Examination Hall Abstract
6 pages
A Machine Learning Analysis of COVID 19 Mental Health Data: Mostafa Rezapour & Lucas Hansen
No ratings yet
A Machine Learning Analysis of COVID 19 Mental Health Data: Mostafa Rezapour & Lucas Hansen
16 pages
Vehicle Detection Tools in Real-Time Using Lora (Long Range)
No ratings yet
Vehicle Detection Tools in Real-Time Using Lora (Long Range)
12 pages
Financial Analytics Insights
No ratings yet
Financial Analytics Insights
45 pages
Data Science: Executive PG Programme in
No ratings yet
Data Science: Executive PG Programme in
53 pages