Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views7 pages

Boosting Reduces Bias

The document discusses the concept of boosting in machine learning, specifically how weak learners can be combined to form a strong learner with reduced bias. It explains the iterative process of constructing ensemble classifiers using techniques like gradient descent and introduces specific algorithms such as Gradient Boosted Regression Trees (GBRT) and AdaBoost. The document also covers the mathematical foundations and optimization strategies involved in these boosting methods.

Uploaded by

chakumchukum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

Boosting Reduces Bias

The document discusses the concept of boosting in machine learning, specifically how weak learners can be combined to form a strong learner with reduced bias. It explains the iterative process of constructing ensemble classifiers using techniques like gradient descent and introduces specific algorithms such as Gradient Boosted Regression Trees (GBRT) and AdaBoost. The document also covers the mathematical foundations and optimization strategies involved in these boosting methods.

Uploaded by

chakumchukum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

19: Boosting

previous back next

Boosting reduces Bias

Machine Learning Lecture 32 "Boosting" -Cornell CS4780 S…


S…

Video II Video III

Scenario: Hypothesis class H, whose set of classifiers has large bias and the training error is high (e.g. CART
trees with very limited depth.)
Famous question: In his machine learning class project in 1988 Michael Kearns famously asked the question: Can
weak learners (H ) be combined to generate a strong learner with low bias?
Famous answer: Yes! (Robert Schapire in 1990)
T
Solution: Create ensemble classifier HT (x⃗) = ∑
t=1
αt ht (x⃗) . This ensemble classifier is built in an iterative
fashion. In iteration t we add the classifier αt ht (x⃗) to the ensemble. At test time we evaluate all classifier and
return the weighted sum.
The process of constructing such an ensemble in a stage-wise fashion is very similar to gradient descent.
However, instead of updating the model parameters in each iteration, we add functions to our ensemble.
Let ℓ denote a (convex and differentiable) loss function. With a little abuse of notation we write

n
1
ℓ(H ) = ∑ ℓ(H (xi ), yi ).
n
i=1

Assume we have already finished t iterations and already have an ensemble classifier Ht (x⃗) . Now in iteration
t + 1 we want to add one more weak learner ht+1 to the ensemble. To this end we search for the weak learner
that minimizes the loss the most,

ht+1 = argmin ℓ(Ht + αht ).


h∈H

Once ht+1 has been found, we add it to our ensemble, i.e. Ht+1 := Ht + αh .
How can we find such h ∈ H ?
Answer: Use gradient descent in function space. In function space, inner product can be defined as
n
< h, g >= ∫ h(x)g(x)dx . Since we only have training set, we define < h, g >= ∑
i=1
h(xi )g(xi ) .
x

Gradient descent in functional space


Given H , we want to find the step-size α and (weak learner) h to minimize the loss ℓ(H + αh) . Use Taylor
Approximation on ℓ(H + αh) .

ℓ(H + αh) ≈ ℓ(H ) + α < ∇ℓ(H ), h >.

This approximation (of ℓ as a linear function) only holds within a small region around ℓ(H ) , i. as long as α is
small. We therefore fix it to a small constant (e.g. α ≈ 0.1 ). With the step-size α fixed, we can use the
approximation above to find an almost optimal h:

n
∂ℓ
argmin ℓ(H + αh) ≈ argmin < ∇ℓ(H ), h >= argmin ∑ h(xi )
h∈H h∈H h∈H
∂[H (xi )]
i=1

n
We can write ℓ(H ) = ∑
i=1
ℓ(H (xi )) = ℓ(H (x1 ), . . . , H (xn )) (each prediction is an input to the loss
function)
∂ℓ ∂ℓ
(xi ) =
∂H ∂[H(xi )]

So we can do boosting if we have an algorithm A to solve


n ∂ℓ
ht+1 = argmin ∑ h(x)
h∈H i=1
∂[H(xi )]

ri

n
We need a function A({(x1 , r1 ), … , (xn , rn )}) = argminh∈H ∑ ri h(xi ) . In
i=1

order to make progress this h does not have to be great. We still make progress as long as
n
∑ ri h(xi ) < 0 .
i=1

Generic boosting (a.k.a Anyboost)

Case study #1: Gradient Boosted Regression Tree(GBRT)

k
Classification (yi ∈ {+1, −1} ) or (even multi-dimensional) regression (yi ∈ R )

Weak learners,h ∈ H , are regressors, h(x) ∈ R, ∀x , typically fixed-depth (e.g. depth=4)


regression trees (hence the name).
Step size α is fixed to a small constant (hyper-parameter).
Loss function: Any differentiable convex loss that decomposes over the samples
n
L(H ) = ∑ ℓ(H (xi ))
i=1
In order to use regression trees for gradient boosting, we must be able to find a tree h() that maximizes
n ∂ℓ
h = argmin
h∈H

i=1
ri h(xi ) where ri = .
∂H(xi )

We will make two assumptions:

n 2
1. First, we assume that ∑ h (xi ) = constant. This is simple to do (we normalize the predictions)
i=1
n
and important because we could always decrease ∑ h(xi )ri by rescaling h with a large
i=1
n s
constant. By fixing ∑ h (xi ) to a constant we are essentially fixing the vector h to lie on a circle,
i=1
and we are only concerned with its direction but not its length.

2. CART trees are negation closed, i.e. ∀ h ∈ H => ∃ − h ∈ H . (This is generally true.)
3. We can define the negative graident as ti = −ri .

n
argminh∈H ∑ ri h(xi ) (This is the original AnyBoost formulation.)
i=1
n
= argmin − 2∑ ti h(xi ) (Swapping in ti for −ri and multiplying by 2, which is a
h∈H i=1
constant.)
n 2 2
= argmin ∑ t − 2ti h(xi ) + (h(xi )) (Adding constant
h∈H i=1 i
 
constant constant

2 2
∑ t + h(xi ) .)
i i
n 2
= argmin ∑ (h(xi ) − ti )
h∈H i=1
In other words, we can use the good old Regression trees and feed in the value ti as labels for each xi . Each
iteration we build a new tree for a different set of "labels" t1 , … , tn .

1 n 2
If the loss function ℓ is the squared loss, i.e. ℓ(H ) = ∑
i=1
(H (xi ) − yi ) , then it is easy to show that
2

∂ℓ
ti = − = yi − H (xi ),
H (xi )

which is simply the residual, i.e. r is the vector pointing from y to H. However, it is important that you can use any
other differentiable and convex loss function ℓ , and the solution for your next weak learner h() will always be the
regression tree minimizing the squared loss.

GBRT in Pseudo Code

Case Study #2: AdaBoost

Setting: Classification (yi ∈ {+1, −1} )

Weak learners: h ∈ H are binary, h(xi ) ∈ {−1, +1}, ∀x


Step-size: We perform line-search to obtain best step-size α .
n −yi H(xi )
Loss function: Exponential loss ℓ(H ) = ∑ e
i=1

Finding the best weak learner

∂ℓ −yi H(xi )
First we compute the gradient ri = = −yi e .
∂H(xi )

For notational convenience (and for reason that will become clear in a little bit), let us define
1 −yi H(xi ) n −yi H(xi )
wi = e , where Z = ∑ e is a normalizing factor so that
Z i=1
n
∑ wi = 1. Note that the normalizing constant Z is identical to the loss function. Each weight wi
i=1

therefore has a very nice interpretation. It is the relative contribution of the training point (xi , yi ) towards the
overall loss.

In order to find the best next weak learner, we need to solve the optimization problem: (in the following, we will
make use of the fact that h(xi ) ∈ {+1, −1} .)

−H(xi )yi
h(xi ) = argmin ∑ ri h(xi ) (substitute in: ri = e )
h∈H

i=1

n
1
−H(xi )y −H(xi )y
i i
= argminh∈H − ∑ yi e h(xi ) (substitute in: wi = e )
Z
i=1

= argmin − ∑ wi yi h(xi ) (yi h(xi ) ∈ {+1, −1} with h(xi )yi = 1 ⟺ h(xi ) = yi )
h∈H

i=1

= argmin ∑ wi − ∑ wi ( ∑ wi = 1 − ∑ wi )
h∈H

i:h(xi )≠y i:h(xi )=y i:h(xi )=y i:h(xi )≠y


i i i i

= argminh∈H ∑ wi (This is the weighted classification error.)

i:h(xi )≠yi

Let us denote this weighted classification error as ϵ = ∑ wi . So for AdaBoost, we only


i:h(xi )yi =−1
need a classifier that can take training data and a distribution over the training set (i.e. normalzied weights wi for
all training samples) and which returns a classifier h ∈ H that reduces the weighted classification error of
these training samples. It doesn't have to do all that well, in order for the inner-product ∑ ri h(xi ) to be
i

negative, it just needs less than ϵ < 0.5 weighted training error.

Finding the stepsize α

In the previous example, GBRT, we set the stepsize α to be a small constant. As it turns out, in the AdaBoost
setting we can find the optimal stepsize (i.e. the one that minimizes ℓ the most) in closed form every time we take
a "gradient" step.

When we are given ℓ, H , h , we would like to solve the following optimization problem:

α = argminα ℓ(H + αh)


n

−yi [H(xi )+αh(xi )]


= argmin ∑e
α

i=1

We differentiate w.r.t. α and equate with zero:


n
−(y H(xi )+αy h(xi ))
i i
∑ yi h(xi )e = 0 (yi h(xi ) ∈ {+1, −1})

i=1

−(yi H(xi )+α yi h(xi ) ) −(y H(xi )+α y h(xi ) )


i i

 1
−y H(xi )
−1 i
− ∑ e 1
+ ∑ e = 0 (wi = e )
Z
i:h(xi )yi =1 i:h(xi )yi =−1

−α α
− ∑ wi e + ∑ wi e = 0 (ϵ = ∑ wi )

i:h(xi )y =1 i:h(xi )y =−1 i:h(xi )y =−1


i i i

−α α
−(1 − ϵ)e + ϵe = 0

1−ϵ

e =
ϵ

1 1−ϵ
α = ln
2 ϵ

It is unusual that we can find the optimal step-size in such a simple closed form. One consequence is that
AdaBoost converges extremely fast.

Re-normalization

After you take a step, i.e. Ht+1 = Ht + αh , you need to re-compute all the weights and then re-normalize. It is
however straight-forward to show that the unnormalized weight w
^ is updated as
i

−αh(xi )yi
^ ← w
w ^ ∗e
i i

and that the normalizer Z becomes


−−−−−−
Z ← Z ∗ 2√ ϵ(1 − ϵ) .

Putting these two together we obtain the following multiplicative update rule:

−αh(xi )yi
e
wi ← wi .
− −−−−−
2√ ϵ(1 − ϵ)

AdaBoost Pseudo-code

A few remarks:
As long as H is negation closed (this means for every h ∈ H we must also have −h ∈ H ), it
1
cannot be that the error ϵ > . The reason is simply that if h has error ϵ, it must be that −h has error
2

1 − ϵ. So you could just flip h to −h and obtain a classifier with smaller error. As h was found by
minimizing the error, this is a contradiction.
1 1
The inner loop can terminate as the error ϵ = , and in most cases it will converge to over time. In
2 2

that case the latest weak learner h is only as good as a coin toss and cannot benefit the ensemble
1
(therefore boosting terminates). Also note that if ϵ = the step-size α would be zero.
2

Further analysis

Let us examine each one of these updates.

The weight update:

−αh(xi )yi
^i ← w
w ^i ∗ e ,

as, h(xi )yi is either +1 (if classified correctly by this weak learner) or −1 (otherwise), it follows that
α
this weight update multiplies the weight wi either by a factor e > 1 if it was classified incorrectly (i.e.
−α
increases the weights), or by a factor e < 1 if it was classified correctly (i.e. decreases the weight).
Normalization update:

−−−−−−
Z ← Z ∗ 2√ϵ(1 − ϵ) .

Previously we established that the normalizer Z is identical to the loss. We can therefore use it to bound the
loss function after T iterations:

T
−−−−−−−−
ℓ(H ) = Z = n ∏ 2√ϵt (1 − ϵt ) ,

t=1

1
(the factor n comes from the fact that the initial Z0 = n , when all weights are . ) If we define
n
c = max tϵt , we can establish

−−−−−− T

ℓ(H ) ≤ n[2√c(1 − c) ] .

1 1
The function c(1 − c) is maximized at c = . But we know that each ϵt < (or else the
2 2
1
algorithm would have terminated). Therefore c(1 − c) < and we can re-write it as
4
1 2
c(1 − c) = − γ , for some γ. This leaves us with
4

2 2
ℓ(H ) ≤ n(1 − 4γ ) .

In other words, the training loss is decreasing exponentially!

In fact, we can go even further and compute after how many iterations we must have zero training error. Note
n
that the training loss is an upper bound on the training error (defined as ∑i=1 δH(x i
)≠y
) - simply because
i

−y H(xi )
δH(x
i
)≠y
< e i
in all cases. We can then compute the number of steps required until the loss is less
i

than 1, which would imply that not a single training input is misclassified.
T
2 log(n)
2 2
n (1 − 4γ ) < 1 ⇒ T > .
1
log( )
2
1−4γ

This is an amazing result. It shows that after O(log(n)) iterations your training error must be zero. In
practice it often makes sense to keep boosting even after you make no more mistakes on the training set.

Summary

Boosting is a great way to turn a week classifier into a strong classifier. It defines a whole family of algorithms,
including Gradient Boosting, AdaBoost, LogitBoost, and many others ... Gradient Boosted Regression Trees is one
of the most popular algorithms for Learning to Rank, the branch of machine learning focused on learning ranking
functions, for example for web search engines. A few additional things to know:
The step size α is often referred to as shrinkage.
Some people do not consider gradient boosting algorithms to be part of the boosting family, because they
have no guarantee that the training error decreases exponentially. Often these algorithms are referred to as
stage-wise regression instead.
Inspired by Breiman's Bagging, stochastic gradient boosting subsamples the training data for each weak
learner. This combines the benefits of bagging and boosting. One variant is to subsample only n/2 data
points without replacement, which speeds up the training process.
One advantage of boosted classifiers is that during test time the computation
T
H (x) = ∑ αt ht (xt ) can be stopped prematurely if it becomes clear which way the
t=1
prediction goes. This is particularly interesting in search engines, where the exact ranking of results is
typically only interesting for the top 10 search results. Stopping the evaluation of lower ranked search results
can lead to tremendous speed ups. A similar approach is also used by the Viola-Jones algorithm to speed up
face detection in images. Here, the algorithm scans regions of an image to detect possible faces. As almost
all regions and natural images do not contain faces, there are huge savings if the evaluation can be stopped
after just a few weak learners are evaluated. These classifiers are referred to as cascades, that spend very
little time on the common case (no face), but more time on the rare interesting case (face). With this
approach Viola and Jones were the first to solve face recognition in real-time on low performance hardware
(e.g. cameras).
AdaBoost is an extremely powerful algorithm, that turns any weak learner that can classify any weighted
version of the training set with below 0.5 error into a strong learner whose training error decreases
exponentially and that requires only O(log(n)) steps until it is consistent.

You might also like