Introduction to Boosting
Cynthia Rudin
PACM, Princeton
University
Advisors
Ingrid Daubechies and Robert
Schapire
Say you have a database of news articles…
( , +1) ( , +1) ( ) (
, -1 )
, -1
( , +1) ( , +1) ( ) (
, -1 )
, -1
( , +1) ( , +1) ( ) (
, -1 )
, -1
( , +1) ( , +1) ( ) (
, -1 )
, +1
where articles are labeled ‘+1’ if the category is “entertainment”,
and ‘-1’ otherwise.
Your goal is: Given a new article , find its label.
This is not easy, there are noisy datasets, high dimensions.
Examples of Statistical Learning Tasks:
• Optical Character Recognition (OCR) (post office,
banks), object recognition in images.
• Bioinformatics (analysis of gene array data for tumor
detection, protein classification, etc.)
• Webpage classification (search engines), email filtering,
document retrieval
• Semantic classification for speech, automatic .mp3
sorting
• Time-series prediction (regression)
Huge number of applications, but all have high dimensional data
Examples of classification algorithms:
• SVM’s (Support Vector Machines – large margin classifiers)
• Neural Networks
• Decision Trees / Decision Stumps (CART)
• RBF Networks
• Nearest Neighbors
• Bayes Net
Which is the best?
Depends on amount and type of data, and application!
It’s a tie between SVM’s and Boosted Decision Trees/Stumps
for general applications.
One can always find a problem where a particular algorithm is the best. Boosted convolutional neural nets are the best for
OCR (Yann LeCun et al).
Training Data: {(xi,yi)}i=1..m where (xi,yi) is chosen iid from an
unknown probability distribution on X×{-1,1}.
“space of all possible articles” “labels”
Huge Question: Given a new random example x, can we predict
its correct label with high probability? That is, can we generalize
from our training data?
X + +
_
_
+
? _
Huge Question: Given a new random example x, can we predict
its correct label with high probability? That is, can we generalize
from our training data?
Yes!!! That’s what the field of statistical learning is all about.
The goal of statistical learning is to characterize points from an
unknown probability distribution when given a representative
sample from that distribution.
How do we construct a classifier?
• Divide the space X into two sections, based on the sign of a
function f : X→R.
• Decision boundary is the zero-level set of f. f(x)=0
+
-
X + +
_
_
+
? _
Classifiers divide the space into two pieces for binary classification. Multiclass classification can always be reduced to binary.
Y Overview of Talk Z
• The Statistical Learning Problem (done)
• Introduction to Boosting and AdaBoost
• AdaBoost as Coordinate Descent
• The Margin Theory and Generalization
Say we have a “weak” learning algorithm:
• A weak learning algorithm produces weak classifiers.
• (Think of a weak classifier as a “rule of thumb”)
Examples of weak classifiers for “entertainment” application:
h1( ) = +1 if contains the term “movie”,
-1 otherwise
h2( ) = +1 if contains the term “actor”,
-1 otherwise
h3( ) = +1 if contains the term “drama”,
-1 otherwise
Wouldn’t it be nice to combine the weak classifiers?
Boosting algorithms combine weak
classifiers in a meaningful way.
Example:
f( ) = sign[.4 h1 ( ) + .3 h2 ( ) + .3 h3 ( )]
ASo
boosting
if the article
algorithm
contains
takesthe
asterm
input:
“movie”, and the word “drama”,
but not the word “actor”:
- the weak learning algorithm which produces the weak classifiers
- a large
Thetraining
value ofdatabase
f is sign[.4-.3+.3] = 1, so we label it +1.
and outputs:
- the coefficients of the weak classifiers to make the combined
classifier
Two ways to use a Boosting Algorithm:
• As a way to increase the performance of already `strong`
classifiers.
• Ex. neural networks, decision trees
• “On their own” with a really basic weak classifier
• Ex. decision stumps
AdaBoost
(Freund and Schapire ’95)
-Start with a uniform distribution (“weights”) over training examples.
(The
At theweights tell the
end, make weak learning
(carefully!) algorithm
a linear which
combination ofexamples
the weak are important.)
classifiers
obtained at all iterations.
-Request a weak classifier from the weak learning algorithm, hj :X→{-1,1}.
f final (x) = sign(λ1h1 (x) + K + λn hn (x)) t
-Increase the weights on the training examples that were misclassified.
-(Repeat)
AdaBoost
Define three important things:
d t ∈ R m := distribution (“weights”) over examples at time t
dt = [ .25 .3 .2 .25 ]
1 2 3 4
AdaBoost
Define three important things:
λ t ∈ R n := coeffs of weak classifiers for the linear combination
f t (x) = sign(λt ,1h1 (x) + ... + λt ,n hn (x))
AdaBoost
Define:
M ∈ R m×n :=matrix of hypotheses and data
Enumerate every possible weak
classifier which can be produced
by weak learning algorithm
h1 hj hn
“movie” “actor” “drama”
i Mij
# of data points ⎧ 1 if weak classif. h j classifies pt x i correctly
M ij := h j (x i ) yi = ⎨
⎩− 1 otherwise
The matrix M has too many columns to actually be enumerated. M acts as the only input to AdaBoost.
M AdaBoost λ final
dt , λ t
AdaBoost (Freund and Schapire 95)
λ1 = 0 Initialize coeffs to 0
for t = 1..T final
1
d t ,i = m
e −( Mλ t )i for all i Calculate (normalized) distribution
∑e
i '=1
− ( Mλ t ) i '
jt ∈ arg max j (d Tt M ) j Request weak classif. from
weak learning algorithm
rt = (d Tt M ) jt
α t = ln⎜⎜
1 t
⎟⎟
2 ⎝ 1 − rt ⎠
⎛1+ r ⎞
λ t +1 = λ t + α t e jt
} Update linear combo of weak classifiers
end for
AdaBoost (Freund and Schapire 95)
λ1 = 0
for t = 1..T final
1
d t ,i = m
e −( Mλ t )i for all i
∑e
i '=1
− ( Mλ t ) i '
jt ∈ arg max j (d Tt M ) j
rt = (d Tt M ) jt “Edge” or “correlation” of weak classifier jt.
m
1 ⎛1+ r ⎞ (d M ) jt = ∑ d t ,i yi h jt (x i ) = Edt [ yi h jt ]
T
α t = ln⎜⎜ t
⎟⎟ t
i =1
2 ⎝ 1 − rt ⎠
λ t +1 = λ t + α t e jt
end for
Y AdaBoost as Coordinate Descent Z
Breiman, Mason et al., Duffy and Helmbold, etc. noticed that
AdaBoost is a coordinate descent algorithm.
• Coordinate descent is a minimization algorithm like gradient
descent, except that we only move along coordinates.
• We cannot calculate the gradient because of the high
dimensionality of the space!
• “coordinates” = weak classifiers
“distance to move in that direction” = the update αt
AdaBoost minimizes the following function via coordinate descent:
m
F (λ ) := ∑ e −( Mλ )i
i =1
Choose a direction: jt ∈ arg max j (d Tt M ) j
Choose a distance to move in that direction:
rt = (d Tt M ) jt
1 ⎛1+ r ⎞
α t = ln⎜⎜ t
⎟⎟
2 ⎝ 1 − rt ⎠
λ t +1 = λ t + α t e jt
m
The function F (λ ) := ∑ e −( Mλ )i is convex:
i =1
1) If the data is non-separable by the weak classifiers, the minimizer
of F occurs when the size of λ is finite.
(This case is ok. AdaBoost converges to something we understand.)
2) If the data is separable, the minimum of F is 0
(This case is confusing!)
The original paper suggested that AdaBoost would probably
overfit…
But it didn’t in practice!
Why not?
The margin theory!
Y Boosting and Margins Z
• We want the boosted classifier (defined via λ) to
generalize well, i.e., we want it to perform well on
data that is not in the training set.
• The margin theory: The margin of a boosted
classifier indicates whether it will generalize well.
(Schapire et al. ‘98)
• Large margin classifiers work well in practice,
(but there’s more to this story).
Think of the margin as the confidence of a prediction.
Generalization Ability of Boosted Classifiers
Can we guess whether a boosted classifier f generalizes well?
• Can not calculate Prerror(f)
Minimize the rhs of a (loose) inequality such as this one (Schapire et al.)
When there are no training errors, with probability at least 1-δ,
⎛ ⎛ 2 m ⎞
1
2⎞
⎜ 1 ⎜ d log ( d ) ⎟ ⎟
Prerror ( f ) ≤ Ο⎜ + log( 1 ) ⎟.
⎜ m ⎜ ( µ ( f )) 2 δ ⎟ ⎟
⎝ ⎝ ⎠ ⎠
Probability that
classifier f makes # of training examples margin of f
an error on a
random position d=VC dim. of
x∈ X hyp. space, d≤m
The margin theory:
When there are no training errors, with high probability:
(Schapire et al, ‘98)
⎛ d ⎞
~⎜ m ⎟ d=VC dim. of
Prerror ( f ) ≤ Ο⎜ ⎟. hyp. space, d≤m
⎜ µ( f ) ⎟
⎝ ⎠ # of training examples
Probability that
classifier f makes margin of f
an error on a
random position
x∈ X
Large margin = better generalization = smaller probability of error
For Boosting, the margin of combined classifier f λ
(where fλ := sign(λ1h1 + …+ λnhn ) ) is defined by
(Mλ ) i
margin := µ ( fλ ) := min .
i λ1
Does AdaBoost produce maximum margin classifiers?
(AdaBoost was invented before the margin theory…)
(Grove and Schuurmans ’98)
- yes, empirically.
(Schapire, et al. ’98)
- proved AdaBoost achieves at least half the maximum possible
margin.
(Rätsch and Warmuth ’03)
- yes, empirically.
- improved the bound.
(R, Daubechies, Schapire ’04)
- no, it doesn’t.
AdaBoost performs mysteriously well!
AdaBoost performs better than algorithms which are
designed to maximize the margin
Still open: Why does AdaBoost work so well?
Does AdaBoost converge?
Better / more predictable boosting algorithms!