0% found this document useful (0 votes)

62 views29 pages

Introduction To Boosting: Cynthia Rudin PACM, Princeton University

This document provides an introduction to boosting algorithms and AdaBoost in particular. It defines the statistical learning problem of predicting labels for new examples based on training data. Weak learning algorithms produce weak classifiers that make simple predictions. Boosting algorithms like AdaBoost combine many weak classifiers into a strong classifier by iteratively adjusting weights on training examples. At each iteration, AdaBoost selects a weak classifier, increases the weights of misclassified examples, and updates coefficients to build a linear combination of weak classifiers that serves as the final strong classifier.

Uploaded by

asdfasdffdsa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views29 pages

Introduction To Boosting: Cynthia Rudin PACM, Princeton University

Uploaded by

asdfasdffdsa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Introduction to Boosting

Cynthia Rudin
PACM, Princeton
University
Advisors
Ingrid Daubechies and Robert
Schapire
Say you have a database of news articles…

( , +1) ( , +1) ( ) (
, -1 )
, -1

( , +1) ( , +1) ( ) (
, -1 )
, +1

where articles are labeled ‘+1’ if the category is “entertainment”,

and ‘-1’ otherwise.

Your goal is: Given a new article , find its label.

This is not easy, there are noisy datasets, high dimensions.

Examples of Statistical Learning Tasks:
• Optical Character Recognition (OCR) (post office,
banks), object recognition in images.

• Bioinformatics (analysis of gene array data for tumor

detection, protein classification, etc.)

• Webpage classification (search engines), email filtering,

document retrieval

• Semantic classification for speech, automatic .mp3

sorting

• Time-series prediction (regression)

Huge number of applications, but all have high dimensional data
Examples of classification algorithms:
• SVM’s (Support Vector Machines – large margin classifiers)
• Neural Networks
• Decision Trees / Decision Stumps (CART)
• RBF Networks
• Nearest Neighbors
• Bayes Net

Which is the best?

Depends on amount and type of data, and application!

It’s a tie between SVM’s and Boosted Decision Trees/Stumps

for general applications.

One can always find a problem where a particular algorithm is the best. Boosted convolutional neural nets are the best for
OCR (Yann LeCun et al).
Training Data: {(xi,yi)}i=1..m where (xi,yi) is chosen iid from an
unknown probability distribution on X×{-1,1}.
“space of all possible articles” “labels”

Huge Question: Given a new random example x, can we predict

its correct label with high probability? That is, can we generalize
from our training data?

X + +
_
_
+

? _
Huge Question: Given a new random example x, can we predict
its correct label with high probability? That is, can we generalize
from our training data?

Yes!!! That’s what the field of statistical learning is all about.

The goal of statistical learning is to characterize points from an

unknown probability distribution when given a representative
sample from that distribution.
How do we construct a classifier?
• Divide the space X into two sections, based on the sign of a
function f : X→R.

• Decision boundary is the zero-level set of f. f(x)=0

+
-

X + +
_
_
+

? _

Classifiers divide the space into two pieces for binary classification. Multiclass classification can always be reduced to binary.
Y Overview of Talk Z

• The Statistical Learning Problem (done)

• Introduction to Boosting and AdaBoost

• AdaBoost as Coordinate Descent

• The Margin Theory and Generalization

Say we have a “weak” learning algorithm:
• A weak learning algorithm produces weak classifiers.
• (Think of a weak classifier as a “rule of thumb”)

Examples of weak classifiers for “entertainment” application:

h1( ) = +1 if contains the term “movie”,

-1 otherwise

h2( ) = +1 if contains the term “actor”,

-1 otherwise

h3( ) = +1 if contains the term “drama”,

-1 otherwise

Wouldn’t it be nice to combine the weak classifiers?

Boosting algorithms combine weak
classifiers in a meaningful way.
Example:
f( ) = sign[.4 h1 ( ) + .3 h2 ( ) + .3 h3 ( )]

ASo
boosting
if the article
algorithm
contains
takesthe
asterm
input:
“movie”, and the word “drama”,
but not the word “actor”:
- the weak learning algorithm which produces the weak classifiers
- a large
Thetraining
value ofdatabase
f is sign[.4-.3+.3] = 1, so we label it +1.

and outputs:

- the coefficients of the weak classifiers to make the combined

classifier
Two ways to use a Boosting Algorithm:

• As a way to increase the performance of already `strong`

classifiers.

• Ex. neural networks, decision trees

• “On their own” with a really basic weak classifier

• Ex. decision stumps

AdaBoost
(Freund and Schapire ’95)

-Start with a uniform distribution (“weights”) over training examples.

(The
At theweights tell the
end, make weak learning
(carefully!) algorithm
a linear which
combination ofexamples
the weak are important.)
classifiers
obtained at all iterations.
-Request a weak classifier from the weak learning algorithm, hj :X→{-1,1}.
f final (x) = sign(λ1h1 (x) + K + λn hn (x)) t

-Increase the weights on the training examples that were misclassified.

-(Repeat)
AdaBoost
Define three important things:

d t ∈ R m := distribution (“weights”) over examples at time t

dt = [ .25 .3 .2 .25 ]

1 2 3 4
AdaBoost
Define three important things:

λ t ∈ R n := coeffs of weak classifiers for the linear combination

f t (x) = sign(λt ,1h1 (x) + ... + λt ,n hn (x))

AdaBoost
Define:
M ∈ R m×n :=matrix of hypotheses and data
Enumerate every possible weak
classifier which can be produced
by weak learning algorithm
h1 hj hn
“movie” “actor” “drama”

i Mij

# of data points ⎧ 1 if weak classif. h j classifies pt x i correctly

M ij := h j (x i ) yi = ⎨
⎩− 1 otherwise

The matrix M has too many columns to actually be enumerated. M acts as the only input to AdaBoost.
M AdaBoost λ final

dt , λ t
AdaBoost (Freund and Schapire 95)

λ1 = 0 Initialize coeffs to 0
for t = 1..T final
1
d t ,i = m
e −( Mλ t )i for all i Calculate (normalized) distribution
∑e
i '=1
− ( Mλ t ) i '

jt ∈ arg max j (d Tt M ) j Request weak classif. from

weak learning algorithm

rt = (d Tt M ) jt

α t = ln⎜⎜
1 t
⎟⎟
2 ⎝ 1 − rt ⎠
⎛1+ r ⎞

λ t +1 = λ t + α t e jt
} Update linear combo of weak classifiers

end for
AdaBoost (Freund and Schapire 95)

λ1 = 0
for t = 1..T final
1
d t ,i = m
e −( Mλ t )i for all i
∑e
i '=1
− ( Mλ t ) i '

jt ∈ arg max j (d Tt M ) j

rt = (d Tt M ) jt “Edge” or “correlation” of weak classifier jt.

m
1 ⎛1+ r ⎞ (d M ) jt = ∑ d t ,i yi h jt (x i ) = Edt [ yi h jt ]
T
α t = ln⎜⎜ t
⎟⎟ t
i =1
2 ⎝ 1 − rt ⎠
λ t +1 = λ t + α t e jt
end for
Y AdaBoost as Coordinate Descent Z
Breiman, Mason et al., Duffy and Helmbold, etc. noticed that
AdaBoost is a coordinate descent algorithm.

• Coordinate descent is a minimization algorithm like gradient

descent, except that we only move along coordinates.

• We cannot calculate the gradient because of the high

dimensionality of the space!

• “coordinates” = weak classifiers

“distance to move in that direction” = the update αt
AdaBoost minimizes the following function via coordinate descent:

m
F (λ ) := ∑ e −( Mλ )i
i =1

Choose a direction: jt ∈ arg max j (d Tt M ) j

Choose a distance to move in that direction:

rt = (d Tt M ) jt
1 ⎛1+ r ⎞
α t = ln⎜⎜ t
⎟⎟
2 ⎝ 1 − rt ⎠
λ t +1 = λ t + α t e jt
m
The function F (λ ) := ∑ e −( Mλ )i is convex:
i =1

1) If the data is non-separable by the weak classifiers, the minimizer

of F occurs when the size of λ is finite.

(This case is ok. AdaBoost converges to something we understand.)

2) If the data is separable, the minimum of F is 0

(This case is confusing!)

The original paper suggested that AdaBoost would probably
overfit…

But it didn’t in practice!

Why not?

The margin theory!

Y Boosting and Margins Z

• We want the boosted classifier (defined via λ) to

generalize well, i.e., we want it to perform well on
data that is not in the training set.

• The margin theory: The margin of a boosted

classifier indicates whether it will generalize well.
(Schapire et al. ‘98)

• Large margin classifiers work well in practice,

(but there’s more to this story).

Think of the margin as the confidence of a prediction.

Generalization Ability of Boosted Classifiers
Can we guess whether a boosted classifier f generalizes well?
• Can not calculate Prerror(f)

Minimize the rhs of a (loose) inequality such as this one (Schapire et al.)
When there are no training errors, with probability at least 1-δ,
⎛ ⎛ 2 m ⎞
1
2⎞
⎜ 1 ⎜ d log ( d ) ⎟ ⎟
Prerror ( f ) ≤ Ο⎜ + log( 1 ) ⎟.
⎜ m ⎜ ( µ ( f )) 2 δ ⎟ ⎟
⎝ ⎝ ⎠ ⎠
Probability that
classifier f makes # of training examples margin of f
an error on a
random position d=VC dim. of
x∈ X hyp. space, d≤m
The margin theory:
When there are no training errors, with high probability:
(Schapire et al, ‘98)

⎛ d ⎞
~⎜ m ⎟ d=VC dim. of
Prerror ( f ) ≤ Ο⎜ ⎟. hyp. space, d≤m
⎜ µ( f ) ⎟
⎝ ⎠ # of training examples
Probability that
classifier f makes margin of f
an error on a
random position
x∈ X

Large margin = better generalization = smaller probability of error

For Boosting, the margin of combined classifier f λ
(where fλ := sign(λ1h1 + …+ λnhn ) ) is defined by
(Mλ ) i
margin := µ ( fλ ) := min .
i λ1
Does AdaBoost produce maximum margin classifiers?

(AdaBoost was invented before the margin theory…)

(Grove and Schuurmans ’98)

- yes, empirically.

(Schapire, et al. ’98)

- proved AdaBoost achieves at least half the maximum possible
margin.

(Rätsch and Warmuth ’03)

- yes, empirically.
- improved the bound.

(R, Daubechies, Schapire ’04)

- no, it doesn’t.
AdaBoost performs mysteriously well!

AdaBoost performs better than algorithms which are

designed to maximize the margin
Still open: Why does AdaBoost work so well?
Does AdaBoost converge?

Better / more predictable boosting algorithms!

WBI04 01 MSC 20200123
No ratings yet
WBI04 01 MSC 20200123
29 pages
Computational Data Analysis: Machine Learning
No ratings yet
Computational Data Analysis: Machine Learning
26 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
Boosting and Applications Yuan
No ratings yet
Boosting and Applications Yuan
41 pages
Lecture Notes 7
No ratings yet
Lecture Notes 7
8 pages
Ada Boost
No ratings yet
Ada Boost
25 pages
Boosting Approach To Machine Learn
No ratings yet
Boosting Approach To Machine Learn
23 pages
A Brief Introduction To Adaboost: Hongbo Deng 6 Feb, 2007
No ratings yet
A Brief Introduction To Adaboost: Hongbo Deng 6 Feb, 2007
35 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
No ratings yet
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
19 pages
Boosting Mit
No ratings yet
Boosting Mit
36 pages
Adaboost: Derek Hoiem March 31, 2004
No ratings yet
Adaboost: Derek Hoiem March 31, 2004
46 pages
Boosting Algorithms Explained
No ratings yet
Boosting Algorithms Explained
79 pages
Adaboost
No ratings yet
Adaboost
22 pages
1 Eric Boosting304FinalRpdf
No ratings yet
1 Eric Boosting304FinalRpdf
19 pages
AdaBoost: A Guide for Data Scientists
No ratings yet
AdaBoost: A Guide for Data Scientists
17 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
Boosting
No ratings yet
Boosting
11 pages
Chapter 3 - Boosting Theory
No ratings yet
Chapter 3 - Boosting Theory
7 pages
CS229 Supplemental Lecture Notes: 1 Boosting
No ratings yet
CS229 Supplemental Lecture Notes: 1 Boosting
11 pages
Ada Boost
No ratings yet
Ada Boost
7 pages
کتاب هفتم بارگزاری شده
No ratings yet
کتاب هفتم بارگزاری شده
57 pages
07 Boosting Notes
No ratings yet
07 Boosting Notes
10 pages
Pradipta Kumar Pattanayak - Ada Boosting
No ratings yet
Pradipta Kumar Pattanayak - Ada Boosting
44 pages
FAQ - Boosting - Ensemble Techniques - Great Learning
No ratings yet
FAQ - Boosting - Ensemble Techniques - Great Learning
2 pages
Boosting
No ratings yet
Boosting
31 pages
Multi-class AdaBoost Explained
No ratings yet
Multi-class AdaBoost Explained
12 pages
DM (Boosting)
No ratings yet
DM (Boosting)
15 pages
Addaboost
No ratings yet
Addaboost
12 pages
Boosting Algorithms in Machine Learning
100% (1)
Boosting Algorithms in Machine Learning
41 pages
ENG6500 7 Ensembles Boosting
No ratings yet
ENG6500 7 Ensembles Boosting
49 pages
Boosting and AdaBoost For Machine Learning
No ratings yet
Boosting and AdaBoost For Machine Learning
18 pages
Adaboost Matas
No ratings yet
Adaboost Matas
136 pages
Adaboost
No ratings yet
Adaboost
29 pages
AdaBoost New PDF
No ratings yet
AdaBoost New PDF
45 pages
Boosting Algorithms Explained
No ratings yet
Boosting Algorithms Explained
2 pages
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
No ratings yet
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
16 pages
AdaBoost M1
No ratings yet
AdaBoost M1
16 pages
Scha Pire
No ratings yet
Scha Pire
182 pages
Foundations of Machine Learning: Courant Institute and Google Research
No ratings yet
Foundations of Machine Learning: Courant Institute and Google Research
42 pages
AdaBoost Final
No ratings yet
AdaBoost Final
97 pages
22 Boosting
No ratings yet
22 Boosting
32 pages
Improving Classification With AdaBoost
No ratings yet
Improving Classification With AdaBoost
20 pages
LECTURE+NOTES Boosting
No ratings yet
LECTURE+NOTES Boosting
8 pages
Survey - Gradient Boosting Machine
No ratings yet
Survey - Gradient Boosting Machine
9 pages
Bagging Vs Boosting in Machine Learning
No ratings yet
Bagging Vs Boosting in Machine Learning
5 pages
AdaBoost Algorithm: Key Features & Benefits
No ratings yet
AdaBoost Algorithm: Key Features & Benefits
9 pages
Ensemble Classifiers Overview
No ratings yet
Ensemble Classifiers Overview
37 pages
Class Adv Classification V
No ratings yet
Class Adv Classification V
50 pages
Improved Boosting with Confidence
No ratings yet
Improved Boosting with Confidence
40 pages
Lecture 16: Boosting - Applied ML
No ratings yet
Lecture 16: Boosting - Applied ML
20 pages
An Introduction To Boosting and Leveraging: 1 A Brief History of Boosting
No ratings yet
An Introduction To Boosting and Leveraging: 1 A Brief History of Boosting
66 pages
Lecture18 Boosting
No ratings yet
Lecture18 Boosting
21 pages
Machine Learning Boosting Guide
No ratings yet
Machine Learning Boosting Guide
27 pages
Unit V - Multiple Learners
No ratings yet
Unit V - Multiple Learners
54 pages
AdaBoost and Weak Learning Foundations
No ratings yet
AdaBoost and Weak Learning Foundations
41 pages
Bagging+Boosting+Gradient Boosting
100% (1)
Bagging+Boosting+Gradient Boosting
48 pages
Bagging - Boosting
No ratings yet
Bagging - Boosting
9 pages
Mixture Models for Data Clustering
No ratings yet
Mixture Models for Data Clustering
7 pages
Iterative Reweighted Least Squares: Sargur N. Srihari
No ratings yet
Iterative Reweighted Least Squares: Sargur N. Srihari
22 pages
Probability Theory: Sargur N. Srihari Srihari@cedar - Buffalo.edu
No ratings yet
Probability Theory: Sargur N. Srihari Srihari@cedar - Buffalo.edu
49 pages
Kernel Methods for ML Experts
No ratings yet
Kernel Methods for ML Experts
29 pages
Forward Sampling: Sargur Srihari Srihari@buffalo - Edu
No ratings yet
Forward Sampling: Sargur Srihari Srihari@buffalo - Edu
13 pages
K-Means Clustering: Sargur Srihari Srihari@cedar - Buffalo.edu
No ratings yet
K-Means Clustering: Sargur Srihari Srihari@cedar - Buffalo.edu
20 pages
Ch13 4-LinearDynamicalSystems
No ratings yet
Ch13 4-LinearDynamicalSystems
20 pages
Basic Sampling Methods: Sargur Srihari Srihari@cedar - Buffalo.edu
No ratings yet
Basic Sampling Methods: Sargur Srihari Srihari@cedar - Buffalo.edu
30 pages
Regularization in Neural Networks: Sargur Srihari Srihari@buffalo - Edu
No ratings yet
Regularization in Neural Networks: Sargur Srihari Srihari@buffalo - Edu
31 pages
SVMs: Techniques & Applications
No ratings yet
SVMs: Techniques & Applications
42 pages
SK1-BRK-01-Brake System Bleeding-Rev 1.0
No ratings yet
SK1-BRK-01-Brake System Bleeding-Rev 1.0
9 pages
Goat Housing Design Guide
No ratings yet
Goat Housing Design Guide
2 pages
Android-Controlled Pesticide Spraying Robot
No ratings yet
Android-Controlled Pesticide Spraying Robot
6 pages
Mtn66060008-Usermanual 2
No ratings yet
Mtn66060008-Usermanual 2
46 pages
Bhairahawa Engineering and Builders PVT - LTD: Core Contract Documents CLIENT: .
No ratings yet
Bhairahawa Engineering and Builders PVT - LTD: Core Contract Documents CLIENT: .
5 pages
Ann Cum Syllabus AP English 10-04-2025 1
No ratings yet
Ann Cum Syllabus AP English 10-04-2025 1
5 pages
Qkhttiepdiendeso 01
No ratings yet
Qkhttiepdiendeso 01
2 pages
CADVR-1004FD / - 08FD: Honeywell Black
No ratings yet
CADVR-1004FD / - 08FD: Honeywell Black
4 pages
Libble Eu
No ratings yet
Libble Eu
55 pages
DiGi KaGB T&C
No ratings yet
DiGi KaGB T&C
5 pages
Insurance Industry Career
No ratings yet
Insurance Industry Career
6 pages
Martin Et Al Manuscript Final
No ratings yet
Martin Et Al Manuscript Final
74 pages
AI Lesson: Conditionals & Vocabulary
No ratings yet
AI Lesson: Conditionals & Vocabulary
6 pages
(Hooker and Monas, 2008) Shoestring Venture - The Startup Bible
No ratings yet
(Hooker and Monas, 2008) Shoestring Venture - The Startup Bible
532 pages
Span 210-MW Syllabus Spring 2014
No ratings yet
Span 210-MW Syllabus Spring 2014
12 pages
Evolution of Handwriting Systems
100% (2)
Evolution of Handwriting Systems
38 pages
How Do Trusses Work
No ratings yet
How Do Trusses Work
14 pages
Total 207 212 27 51 Grand Total
No ratings yet
Total 207 212 27 51 Grand Total
20 pages
Aviation Safety Performance Indicators
No ratings yet
Aviation Safety Performance Indicators
49 pages
Irony Reading
No ratings yet
Irony Reading
17 pages
Software Requirements Specification (SRS)
No ratings yet
Software Requirements Specification (SRS)
5 pages
Student Animal Research Booklets
100% (1)
Student Animal Research Booklets
45 pages
Abs Paris
No ratings yet
Abs Paris
2 pages
Construction Professionals' Epoxy Guide
No ratings yet
Construction Professionals' Epoxy Guide
3 pages
Calculus and Its Applications 11th Edition Bittinger Solutions Manualpdf Download
100% (13)
Calculus and Its Applications 11th Edition Bittinger Solutions Manualpdf Download
42 pages
Pro Forma Invoice
0% (1)
Pro Forma Invoice
1 page
Cleaning Validation MACO Swab Rinse Ovais v1.1
No ratings yet
Cleaning Validation MACO Swab Rinse Ovais v1.1
8 pages