MACHINE LEARNING –UNIT 2
OUTLINE
Logistic regression
Exponential family
Naïve Bayes
Support vector machines
Combining classifiers: Bagging, boosting (The Ada boost algorithm)
Evaluating and debugging learning algorithms
Classification errors.
LOGISTIC REGRESSION
Why use logistic regression?
Estimation by maximum likelihood
Interpreting coefficients
Hypothesis testing
Evaluating the performance of the model
WHY USE LOGISTIC REGRESSION?
There are many important research topics for
which the dependent variable is "limited."
For example: voting, morbidity or mortality,
and participation data is not continuous or
distributed normally.
Binary logistic regression is a type of
regression analysis where the dependent
variable is a dummy variable: coded 0 (did not
vote) or 1(did vote)
THE LINEAR PROBABILITY
MODEL
In the OLS regression:
Y = + X + e ; where Y = (0, 1)
The error terms are heteroskedastic
e is not normally distributed because Y takes on
only two values
The predicted probabilities can be greater than 1 or
less than 0
THE LOGISTIC REGRESSION
MODEL
The "logit" model :
ln[p/(1-p)] = + X + e
p is the probability that the event Y occurs,
p(Y=1)
p/(1-p) is the "odds ratio"
ln[p/(1-p)] is the log odds ratio, or "logit"
More:
The logistic distribution constrains the estimated
probabilities to lie between 0 and 1.
The estimated probability is:
p = 1/[1 + exp(- - X)]
if you let + X =0, then p = .50
as + X gets really big, p approaches 1
as + X gets really small, p approaches 0
COMPARING LP AND LOGIT MODELS
LP Model
1
Logit Model
0
MAXIMUM LIKELIHOOD
ESTIMATION (MLE)
MLE is a statistical method for estimating the
coefficients of a model.
The likelihood function (L) measures the
probability of observing the particular set of
dependent variable values (p1, p2, ..., pn) that
occur in the sample:
L = Prob (p1* p2* * * pn)
The higher the L, the higher the probability of
observing the ps in the sample.
MLE involves finding the coefficients ( , )
that makes the log of the likelihood function
(LL < 0) as large as possible
Or, finds the coefficients that make -2 times
the log of the likelihood function (-2LL) as
small as possible
The maximum likelihood estimates solve the
following condition:
{Y - p(Y=1)}Xi = 0
summed over all observations, i = 1,…,n
INTERPRETING COEFFICIENTS
Since: ln[p/(1-p)] = + X + e
The slope coefficient () is interpreted as the rate of
change in the "log odds" as X changes.
Since:
p = 1/[1 + exp(- - X)]
An interpretation of the logit coefficient which is
usually more intuitive is the "odds ratio"
Since: [p/(1-p)] = exp( + X)
exp() is the effect of the independent variable on the
"odds ratio"
LDA VS. LOGISTIC
REGRESSION
LDA (Generative model)
Assumes Gaussian class-conditional densities and a common covariance
Model parameters are estimated by maximizing the full log likelihood,
parameters for each class are estimated independently of other classes,
Kp+p(p+1)/2+(K-1) parameters
Makes use of marginal density information Pr(X)
Easier to train, low variance, more efficient if model is correct
Higher asymptotic error, but converges faster
Logistic Regression (Discriminative model)
Assumes class-conditional densities are members of the (same) exponential
family distribution
Model parameters are estimated by maximizing the conditional log likelihood,
simultaneous consideration of all other classes, (K-1)(p+1) parameters
Ignores marginal density information Pr(X)
Harder to train, robust to uncertainty about the data generation process
Lower asymptotic error, but converges more slowly
EXPONENTIAL FAMILY
EXPONENTIAL FAMILY -GAUSSIAN
EXPONENTIAL FAMILY
EXPONENTIAL FAMILY
EXPONENTIAL FAMILY
GENERATIVE MODEL-SUPERVISED
GENERATIVE MODEL-SUPERVISED
GAUSSIAN
GAUSSIAN
GENERATIVE MODEL-UNSUPERVISED
GENERATIVE MODEL-UNSUPERVISED
DISCRIMINATIVE MODEL SUPERVISED
DISCRIMINATIVE MODEL SUPERVISED
GENERATIVE VS.
DISCRIMINATIVE LEARNING
Generative Discriminative
Example Linear Discriminant Logistic Regression
Analysis
Objective Functions Full log likelihood: Conditional log likelihood
log p ( x , y )
i
i i log p ( y
i
i | xi )
Model Assumptions Class densities: Discriminant functions
p( x | y k )
e.g. Gaussian in LDA k (x)
Parameter Estimation “Easy” – One single sweep “Hard” – iterative optimization
Advantages More efficient if model More flexible, robust because
correct, borrows strength fewer assumptions
from p(x)
Disadvantages Bias if model is incorrect May also be biased. Ignores
information in p(x)
NAÏVE BAYES
• Bayes classification
P(c | x) P(x | c) P(c) P( x1 , , xn | c) P(c) for c c1 ,..., cL .
Difficulty: learning the joint probabilityP ( x1 , , xn | c) is infeasible!
• Naïve Bayes classification
– Assume all input features are class conditionally independent!
P( x1 , x2 , , xn |c) P( x1 | x2 , , xn , c)P( x2 , , xn |c)
Applying P( x1 |c)P( x2 , , xn |c)
the
independenc P( x1 |c)P( x2 |c) P( xn |c)
e
assumption x' (a , a , , a )
1 2* n
–[ P(aApply the
MAP classification
rule:
assign to c* if
1 , , c L
* * *
1 | c ) P ( a n | c )] P ( c ) [ P ( a1 | c ) P ( a n | c )] P ( c ), c c , c c
estimate of P(a1 , , an | c* ) esitmate of P(a1 , , an | c)
NAÏVE BAYES
For each target value of ci (ci c1 , , cL )
Pˆ (ci ) estimate P(ci ) with examples in S;
For every feature value x jk of each feature x j ( j 1, , F ; k 1, , N j )
Pˆ ( x j x jk |ci ) estimate P( x jk |ci ) with examples in S;
x' (a1 , , an )
[Pˆ (a1 |c * ) Pˆ (an |c * )]Pˆ (c * ) [Pˆ (a1 |ci ) Pˆ (an |ci )]Pˆ (ci ), ci c * , ci c1 , , cL
EXAMPLE
• Example: Play Tennis
EXAMPLE
• Learning Phase
Outlook Play=Yes Play=No Temperatur Play=Yes Play=No
e
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5
Humidity Play=Ye Play=N Wind Play=Yes Play=No
s o
High 3/9 4/5
Strong 3/9 3/5
Normal 6/9 1/5
Weak 6/9 2/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
EXAMPLE
• Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
– Decision making with the MAP rule
P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
NAÏVE BAYES
• Algorithm: Continuous-valued Features
– Numberless values taken by a continuous-valued feature
– Conditional probability often modeled with the normal
distribution 1 ( x j ji ) 2
Pˆ ( x j | ci ) exp
2 ji 2 ji
2
ji : mean (avearage) of feature values x j of examples for which c ci
ji : standard deviation of feature values x j of examples for which c ci
for X ( X , , X ), C c , , c
1 n 1 L
– Learning Phase:
n L P(C ci ) i 1, , L
Output: normal distributions and
– Test Phase: Given an unknown instanceX ( a1 , , an )
• Instead of looking-up tables, calculate conditional probabilities with all
the normal distributions achieved in the learning phrase
• Apply the MAP rule to assign a label (the same as done for the discrete
case)
NAÏVE BAYES
• Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class
1 N 1 N Yes 21.64 , Yes 2.35
xn , ( xn )
2 2
No 23.88 , No 7.09
N n1 N n1
– Learning Phase: output two Gaussian models for P(temp|C)
ˆ 1 ( x 21 . 64 ) 2
1 ( x 21 . 64) 2
P ( x | Yes ) exp exp
2.35 2 2 2.35 2.35 2
2
11.09
ˆ 1 ( x 23 .88 ) 2
1 ( x 23. 88 ) 2
P ( x | No)
exp
exp
7.09 2 2 7.09 7.09 2
2
50.25
ZERO CONDITIONAL PROBABILITY
• If no example contains the feature value
– In this circumstance, we face a zero conditional probability
problem during test
Pˆ ( x1 | ci ) Pˆ (a jk | ci ) Pˆ ( xn | ci ) 0 for x j a jk , Pˆ (a jk | ci ) 0
– For a remedy, class conditional probabilities re-estimated with
ˆ nc mp
P (a jk | ci ) (m-estimate)
nm
nc : number of training examples for which x j a jk and c ci
n : number of training examples for which c ci
p : prior estimate (usually, p 1 / t for t possible values of x j )
m : weight to prior (number of " virtual" examples, m 1)
ZERO CONDITIONAL PROBABILITY
• Example: P(outlook=overcast|no)=0 in the play-tennis
dataset
– Adding m “virtual” examples (m: up to 1% of #training
example)
• In this dataset, # of training examples for the “no” class is 5.
• We can only add m=1 “virtual” example in our m-esitmate remedy.
– The “outlook” feature can takes only 3 values. So p=1/3.
– Re-estimate P(outlook|no) with the m-estimate
SUPPORT VECTOR MACHINES
The Scalar Product
a The scalar or dot product is, in some sense, a measure of
Similarity
b
a b a b cos
DECISION FUNCTION
FOR BINARY
CLASSIFICATION f (x) R
f ( xi ) 0 yi 1
f xi 0 yi 1
SUPPORT VECTOR MACHINES
SVMs pick best separating hyperplane according
to some criterion
e.g. maximum margin
Training process is an optimisation
Training set is effectively reduced to a relatively
small number of support vectors
FEATURE SPACES
We may separate data by mapping to a higher-
dimensional feature space
The feature space may even have an infinite
number of dimensions!
We need not explicitly construct the new feature
space
KERNELS
We may use Kernel functions to implicitly map
to a new feature space
Kernel fn:
K x1 , x 2 R
Kernelmust be equivalent to an inner product in
some feature space
EXAMPLE KERNELS
Linear: xz
Polynomial: P x z
Gaussian: 2
exp x z / 2
PERCEPTRON REVISITED: LINEAR
SEPARATORS
Binary classification can be viewed as the task of
separating classes in feature space:
wTx + b = 0
wTx + b > 0
wTx + b < 0
f(x) = sign(wTx + b)
WHICH OF THE LINEAR SEPARATORS
IS OPTIMAL?
BEST LINEAR SEPARATOR?
BEST LINEAR SEPARATOR?
BEST LINEAR SEPARATOR?
BEST LINEAR SEPARATOR?
FIND CLOSEST POINTS IN CONVEX HULLS
d
c
PLANE BISECT CLOSEST POINTS
wT x + b =0
w=d-c
d
c
CLASSIFICATION MARGIN
wT x b
Distance from example data to the separator is r
w
Data closest to the hyperplane are support vectors.
Margin ρ of the separator is the width of separation between
classes. ρ
r
MAXIMUM MARGIN CLASSIFICATION
Maximizing the margin is good according to intuition and
theory.
Implies that only support vectors are important; other training
examples are ignorable.
STATISTICAL LEARNING THEORY
Misclassificationerror and the function
complexity bound generalization error.
Maximizing margins minimizes complexity.
“Eliminates” overfitting.
Solution depends only on Support Vectors not
number of attributes.
MARGINS AND COMPLEXITY
Skinny margin
is more flexible
thus more complex.
MARGINS AND COMPLEXITY
Fat margin
is less complex.
LINEAR SVM MATHEMATICALLY
Assuming all data is at distance larger than 1 from the
hyperplane, the following two constraints follow for a
training set {(xi ,yi)}
wTxi + b ≥ 1 if yi = 1
wTxi + b ≤ -1 if yi = -1
For support vectors, the inequality becomes an
equality; then, since each example’s distance from the
wT x b
2
r
w w
hyperplane is the margin is:
LINEAR SVMS MATHEMATICALLY
(CONT.)
Then we can formulate the quadratic optimization
problem:
Find w and b such that
2
w is maximized and for all {(xi ,yi)}
wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1
SOLVING THE OPTIMIZATION PROBLEM
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1
Quadratic optimization problems are a well-known class of
mathematical programming problems, and many (rather
intricate) algorithms exist for solving them.
The solution involves constructing a dual problem where a
Lagrange multiplier αi is associated with every constraint in the
primary problem:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
THE OPTIMIZATION PROBLEM
SOLUTION
The solution has the form:
w =Σαiyixi b= yk- wTxk for any xk such that αk 0
Each non-zero αi indicates that corresponding xi is a support vector.
Then the classifying function will have the form:
f(x) = ΣαiyixiTx + b
Notice that it relies on an inner product between the test point x
and the support vectors xi – we will return to this later!
Also keep in mind that solving the optimization problem involved
computing the inner products xiTxj between all training points!
SOFT MARGIN CLASSIFICATION
What if the training set is not linearly separable?
Slack variables ξi can be added to allow misclassification
of difficult or noisy examples.
ξi
ξi
SOFT MARGIN CLASSIFICATION
MATHEMATICALLY
The old formulation:
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1
The new formulation incorporating slack variables:
Find w and b such that
Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i
Parameter C can be viewed as a way to control overfitting.
SOFT MARGIN CLASSIFICATION –
SOLUTION
The dual problem for soft margin classification:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
Neither slack variables ξi nor their Lagrange multipliers
appear in the dual problem!
Again, xi with non-zero αi will be support vectors.
Solution to the dual problem is:
w =Σαiyixi k f(x) = ΣαiyixiTx + b
b= yk(1- ξk) - wTxk where k = argmax αk
THEORETICAL JUSTIFICATION FOR
MAXIMUM MARGINS
Vapnik has proved the following:
The class of optimal linear separators has VC dimension h bounded
from above as D 2
h min 2 , m 1
0
where ρ is the margin, D is the diameter of the smallest sphere that
can enclose all of the training examples, and m0 is the
dimensionality.
Intuitively, this implies that regardless of dimensionality m0 we can
minimize the VC dimension by maximizing the margin ρ.
Thus, complexity of the classifier is kept small regardless of
dimensionality.
LINEAR SVMS: OVERVIEW
The classifier is a separating hyperplane.
Most “important” training points are support vectors; they
define the hyperplane.
Quadratic optimization algorithms can identify which training
points xi are support vectors with non-zero Lagrangian
multipliers αi. f(x) = ΣαiyixiTx + b
Both in the dual formulation of the problem and in the solution
training points appear only inside inner products:
EXAMPLE
EXAMPLE
NON-LINEAR SVMS
Datasets that are linearly separable with some noise
work out great:
0 x
But what are we going to do if the dataset is just too
hard?
0 x
How about… mapping data to a higher-dimensional
space: x2
0 x
NONLINEAR CLASSIFICATION
x a, b
xw w1a w2b
( x) a, b, ab, a , b
2 2
( x)w w1a w2b w3ab w4 a w5b
2 2
NON-LINEAR SVMS: FEATURE SPACES
General idea: the original feature space can always be
mapped to some higher-dimensional feature space where
the training set is separable:
Φ: x → φ(x)
THE “KERNEL TRICK”
The linear classifier relies on inner product between vectors
K(xi,xj)=xiTxj
If every datapoint is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
A kernel function is some function that corresponds to an inner product
into some feature space.
Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] =
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
Positive Definite Matrices
A square matrix A is positive definite if xTAx>0 for
all nonzero column vectors x.
It is negative definite if xTAx < 0 for all nonzero x.
It is positive semi-definite if xTAx 0.
And negative semi-definite if xTAx 0 for all x.
WHAT FUNCTIONS ARE KERNELS?
For some functions K(xi,xj) checking that K(xi,xj)= φ(xi)
T
φ(xj) can be cumbersome.
Mercer’s theorem:
Every semi-positive definite symmetric function is a
kernel
Semi-positive definite symmetric functions correspond
to a semi-positive definite symmetric Gram matrix:
K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xN)
K= K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xN)
… … … … …
K(xN,x1) K(xN,x2) K(xN,x3) … K(xN,xN)
EXAMPLES OF KERNEL FUNCTIONS
Linear: K(xi,xj)= xi Txj
Polynomial of power p: K(xi,xj)= (1+ xi Txj)p 2
xi x j
2 2
e
Gaussian (radial-basis function network): K(xi,xj)=
Two-layer perceptron: K(xi,xj)= tanh(β0xi Txj + β1)
NON-LINEAR SVMS MATHEMATICALLY
Dual problem formulation:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
The
f(x) solution
= Σα y K(xis:
, x )+ b
i i i j
Optimization techniques for finding αi’s remain the same!
EXAMPLE
SVM APPLICATIONS
SVMs were originally proposed by Boser, Guyon and Vapnik in 1992
and gained increasing popularity in late 1990s.
SVMs are currently among the best performers for a number of
classification tasks ranging from text to genomic data.
SVM techniques have been extended to a number of tasks such as
regression [Vapnik et al. ’97], principal component analysis
[Schölkopf et al. ’99], etc.
Most popular optimization algorithms for SVMs are SMO [Platt ’99]
and SVMlight [Joachims’ 99], both use decomposition to hill-climb over
a subset of αi’s at a time.
Tuning SVMs remains a black art: selecting a specific kernel and
parameters is usually done in a try-and-see manner.
SVM EXTENSIONS
Regression
VariableSelection
Boosting
Density Estimation
Unsupervised Learning
Novelty/Outlier Detection
Feature Detection
Clustering
LEARNING ENSEMBLES
Learn multiple alternative definitions of a concept using
different training data or different learning algorithms.
Combine decisions of multiple definitions, e.g. using
weighted voting.
Training Data
Data1 Data2 Data m
Learner1 Learner2 Learner m
Model1 Model2 Model m
Model Combiner Final Model
VALUE OF ENSEMBLES
When combing multiple independent and diverse
decisions each of which is at least more accurate than
random guessing, random errors cancel each other out,
correct decisions are reinforced.
Human ensembles are demonstrably better
How many jelly beans in the jar?: Individual estimates vs.
group average.
Who Wants to be a Millionaire: Expert friend vs. audience
vote.
HOMOGENOUS ENSEMBLES
Use a single, arbitrary learning algorithm but
manipulate training data to make it learn
multiple models.
Data1 Data2 … Data m
Learner1 = Learner2 = … = Learner m
Different methods for changing training data:
Bagging: Resample training data
Boosting: Reweight training data
DECORATE: Add additional artificial training data
In WEKA, these are called meta-learners, they
take a learning algorithm as an argument (base
learner) and create a new learning algorithm.
BAGGING
Create ensembles by repeatedly randomly resampling the
training data (Brieman, 1996).
Given a training set of size n, create m samples of size n
by drawing n examples from the original data, with
replacement.
Each bootstrap sample will on average contain 63.2% of the
unique training examples, the rest are replicates.
Combine the m resulting models using simple majority
vote.
Decreases error by decreasing the variance in the results
due to unstable learners, algorithms (like decision trees)
whose output can change dramatically when the training
data is slightly changed.
BOOSTING
Originally developed by computational learning theorists
to guarantee performance improvements on fitting
training data for a weak learner that only needs to
generate a hypothesis with a training accuracy greater
than 0.5 (Schapire, 1990).
Revised to be a practical algorithm, AdaBoost, for
building ensembles that empirically improves
generalization performance (Freund & Shapire, 1996).
Examples are given weights. At each iteration, a new
hypothesis is learned and the examples are reweighted to
focus the system on examples that the most recently
learned classifier got wrong.
BOOSTING: BASIC ALGORITHM
General Loop:
Set all examples to have equal uniform weights.
For t from 1 to T do:
Learn a hypothesis, ht, from the weighted examples
Decrease the weights of examples ht classifies correctly
Base (weak) learner must focus on correctly classifying the
most highly weighted examples while strongly avoiding over-
fitting.
During testing, each of the T hypotheses get a weighted vote
proportional to their accuracy on the training data.
ADABOOST PSEUDOCODE
TrainAdaBoost(D, BaseLearn)
For each example di in D let its weight wi=1/|D|
Let H be an empty set of hypotheses
For t from 1 to T do:
Learn a hypothesis, ht, from the weighted examples: ht=BaseLearn(D)
Add ht to H
Calculate the error, εt, of the hypothesis ht as the total sum weight of the
examples that it classifies incorrectly.
If εt > 0.5 then exit loop, else continue.
Let βt = εt / (1 – εt )
Multiply the weights of the examples that ht classifies correctly by βt
Rescale the weights of all of the examples so the total sum weight remains 1.
Return H
TestAdaBoost(ex, H)
Let each hypothesis, ht, in H vote for ex’s classification with weight log(1/ βt )
Return the class with the highest weighted vote total.
LEARNING WITH WEIGHTED
EXAMPLES
Generic approach is to replicate examples in the
training set proportional to their weights (e.g. 10
replicates of an example with a weight of 0.01
and 100 for one with weight 0.1).
Most algorithms can be enhanced to efficiently
incorporate weights directly in the learning
algorithm so that the effect is the same (e.g.
implement the WeightedInstancesHandler
interface in WEKA).
For decision trees, for calculating information
gain, when counting example i, simply
increment the corresponding count by wi rather
than by 1.
EXAMPLE
Original training set: equal weights to all training samples
Taken from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire
ADABOOST EXAMPLE
ε = error rate of classifier
α = weight of classifier
ROUND 1
ADABOOST EXAMPLE
ROUND 2
ADABOOST EXAMPLE
ROUND 3
ADABOOST EXAMPLE
HOW IS CLASSIFIER COMBINING
DONE?
At each stage we select the best classifier on the current iteration
and combine it with the set of classifiers learned so far
How are the classifiers combined?
Take the weight*feature for each classifier, sum these up, and
compare to a threshold (very simple)
Boosting algorithm automatically provides the appropriate weight for
each classifier and the threshold
This version of boosting is known as the AdaBoost algorithm
Some nice mathematical theory shows that it is in fact a very
powerful machine learning technique
EVALUATING MACHINE LEARNING
ALGORITHMS
•You have developed a machine learning approach to a certain
task, and want to validate that it actually works well (or
determine if it doesn't work well)
• Standard: approach, divide data into training and testing sets,
train method on training set, and report results on the testing set
• Important: testing set is not the same as the validation test
EVALUATING MACHINE LEARNING
ALGORITHMS
The proper way to evaluate an ML algorithm
1. Break all data into training/testing sets (e.g., 70%/30%)
2. Break training set into training/validation set (e.g., 70%/30%
again)
3. Choose hyperparameters using validation set
4. (Optional) Once we have selected hyperparameters, retrain
using all the training set
5. Evaluate performance on the testing set
CLASSIFIER EVALUATION METRICS: CONFUSION
MATRIX
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
Example of Confusion Matrix:
Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
Given m classes, an entry, CMi,j in a confusion matrix indicates
# of tuples in class i that were labeled by the classifier as class j
May have extra rows/columns to provide totals
94
CLASSIFIER EVALUATION METRICS:
ACCURACY, ERROR RATE, SENSITIVITY AND
SPECIFICITY
A\P C ¬C Class Imbalance Problem:
C TP FN P
One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
Significant majority of the
Classifier Accuracy, or negative class and minority of
recognition rate: percentage of test the positive class
Sensitivity: True Positive
set tuples that are correctly
classified recognition rate
Sensitivity = TP/P
Accuracy = (TP + TN)/All
Error rate: 1 – accuracy, or Specificity: True Negative
Error rate = (FP + FN)/All recognition rate
Specificity = TN/N
95
CLASSIFIER EVALUATION METRICS:
PRECISION AND RECALL, AND F-MEASURES
Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
Recall: completeness – what % of positive tuples did the
classifier label as positive?
Perfect score is 1.0
Inverse relationship between precision & recall
F measure (F or F-score): harmonic mean of precision and
1
recall,
Fß: weighted measure of precision and recall
assigns ß times as much weight to recall as to precision
96
CLASSIFIER EVALUATION METRICS: EXAMPLE
Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)
cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)
Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
97
EVALUATING CLASSIFIER ACCURACY:
HOLDOUT & CROSS-VALIDATION METHODS
Holdout method
Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for accuracy estimation
Random sampling: a variation of holdout
Repeat holdout k times, accuracy = avg. of the accuracies obtained
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
At i-th iteration, use D as test set and others as training set
i
Leave-one-out: k folds where k = # of tuples, for small sized
data
*Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
98
EVALUATING CLASSIFIER ACCURACY:
BOOTSTRAP
Bootstrap
Works well with small data sets
Samples the given training tuples uniformly with replacement
i.e., each time a tuple is selected, it is equally likely to be selected again
and re-added to the training set
Several bootstrap methods, and a common one is .632 boostrap
A data set with d tuples is sampled d times, with replacement, resulting in a
training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since (1
– 1/d)d ≈ e-1 = 0.368)
Repeat the sampling procedure k times, overall accuracy of the model:
99
DEBUGGING
DEBUGGING