0% found this document useful (0 votes)

43 views104 pages

Machine Learning - Unit 2

Uploaded by

sandt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views104 pages

Machine Learning - Unit 2

Uploaded by

sandt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 104

MACHINE LEARNING –UNIT 2

OUTLINE
 Logistic regression

 Exponential family

 Naïve Bayes

 Support vector machines

 Combining classifiers: Bagging, boosting (The Ada boost algorithm)

 Evaluating and debugging learning algorithms

 Classification errors.
LOGISTIC REGRESSION
 Why use logistic regression?
 Estimation by maximum likelihood
 Interpreting coefficients
 Hypothesis testing
 Evaluating the performance of the model
WHY USE LOGISTIC REGRESSION?

 There are many important research topics for

which the dependent variable is "limited."
 For example: voting, morbidity or mortality,
and participation data is not continuous or
distributed normally.
 Binary logistic regression is a type of
regression analysis where the dependent
variable is a dummy variable: coded 0 (did not
vote) or 1(did vote)
THE LINEAR PROBABILITY
MODEL
In the OLS regression:
Y =  + X + e ; where Y = (0, 1)
 The error terms are heteroskedastic

 e is not normally distributed because Y takes on

only two values
 The predicted probabilities can be greater than 1 or
less than 0
THE LOGISTIC REGRESSION
MODEL
The "logit" model :

ln[p/(1-p)] =  + X + e

 p is the probability that the event Y occurs,

p(Y=1)
 p/(1-p) is the "odds ratio"
 ln[p/(1-p)] is the log odds ratio, or "logit"
More:
 The logistic distribution constrains the estimated
probabilities to lie between 0 and 1.
 The estimated probability is:

p = 1/[1 + exp(- -  X)]

 if you let  +  X =0, then p = .50

 as  +  X gets really big, p approaches 1
 as  +  X gets really small, p approaches 0
COMPARING LP AND LOGIT MODELS

LP Model

1
Logit Model

0
MAXIMUM LIKELIHOOD
ESTIMATION (MLE)
 MLE is a statistical method for estimating the
coefficients of a model.
 The likelihood function (L) measures the
probability of observing the particular set of
dependent variable values (p1, p2, ..., pn) that
occur in the sample:
L = Prob (p1* p2* * * pn)
 The higher the L, the higher the probability of
observing the ps in the sample.
 MLE involves finding the coefficients ( , )
that makes the log of the likelihood function
(LL < 0) as large as possible
 Or, finds the coefficients that make -2 times
the log of the likelihood function (-2LL) as
small as possible
 The maximum likelihood estimates solve the
following condition:

{Y - p(Y=1)}Xi = 0

summed over all observations, i = 1,…,n

INTERPRETING COEFFICIENTS
 Since: ln[p/(1-p)] =  + X + e

The slope coefficient () is interpreted as the rate of

change in the "log odds" as X changes.
 Since:

p = 1/[1 + exp(- -  X)]

 An interpretation of the logit coefficient which is
usually more intuitive is the "odds ratio"
 Since: [p/(1-p)] = exp( + X)

exp() is the effect of the independent variable on the

"odds ratio"
LDA VS. LOGISTIC
REGRESSION
 LDA (Generative model)
 Assumes Gaussian class-conditional densities and a common covariance
 Model parameters are estimated by maximizing the full log likelihood,
parameters for each class are estimated independently of other classes,
Kp+p(p+1)/2+(K-1) parameters
 Makes use of marginal density information Pr(X)
 Easier to train, low variance, more efficient if model is correct
 Higher asymptotic error, but converges faster
 Logistic Regression (Discriminative model)
 Assumes class-conditional densities are members of the (same) exponential
family distribution
 Model parameters are estimated by maximizing the conditional log likelihood,
simultaneous consideration of all other classes, (K-1)(p+1) parameters
 Ignores marginal density information Pr(X)
 Harder to train, robust to uncertainty about the data generation process
 Lower asymptotic error, but converges more slowly
EXPONENTIAL FAMILY
EXPONENTIAL FAMILY -GAUSSIAN
EXPONENTIAL FAMILY
EXPONENTIAL FAMILY
EXPONENTIAL FAMILY
GENERATIVE MODEL-SUPERVISED
GENERATIVE MODEL-SUPERVISED
GAUSSIAN
GAUSSIAN
GENERATIVE MODEL-UNSUPERVISED
GENERATIVE MODEL-UNSUPERVISED
DISCRIMINATIVE MODEL SUPERVISED
DISCRIMINATIVE MODEL SUPERVISED
GENERATIVE VS.
DISCRIMINATIVE LEARNING

Generative Discriminative
Example Linear Discriminant Logistic Regression
Analysis

Objective Functions Full log likelihood: Conditional log likelihood

 log p ( x , y )
i
i i  log p ( y
i
i | xi )

Model Assumptions Class densities: Discriminant functions

p( x | y  k )
e.g. Gaussian in LDA k (x)
Parameter Estimation “Easy” – One single sweep “Hard” – iterative optimization

Advantages More efficient if model More flexible, robust because

correct, borrows strength fewer assumptions
from p(x)
Disadvantages Bias if model is incorrect May also be biased. Ignores
information in p(x)
NAÏVE BAYES
• Bayes classification
P(c | x)  P(x | c) P(c)  P( x1 ,  , xn | c) P(c) for c  c1 ,..., cL .
Difficulty: learning the joint probabilityP ( x1 ,  , xn | c) is infeasible!

• Naïve Bayes classification

estimate of P(a1 ,  , an | c* ) esitmate of P(a1 ,  , an | c)

NAÏVE BAYES

For each target value of ci (ci  c1 ,  , cL )

Pˆ (ci )  estimate P(ci ) with examples in S;
For every feature value x jk of each feature x j ( j  1,  , F ; k  1,  , N j )
Pˆ ( x j  x jk |ci )  estimate P( x jk |ci ) with examples in S;

x'  (a1 ,  , an )

[Pˆ (a1 |c * )    Pˆ (an |c * )]Pˆ (c * )  [Pˆ (a1 |ci )    Pˆ (an |ci )]Pˆ (ci ), ci  c * , ci  c1 ,  , cL
EXAMPLE
• Example: Play Tennis
EXAMPLE
• Learning Phase
Outlook Play=Yes Play=No Temperatur Play=Yes Play=No
e
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5
Humidity Play=Ye Play=N Wind Play=Yes Play=No
s o
High 3/9 4/5
Strong 3/9 3/5
Normal 6/9 1/5
Weak 6/9 2/5

P(Play=Yes) = 9/14 P(Play=No) = 5/14

EXAMPLE
• Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
– Decision making with the MAP rule

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

NAÏVE BAYES
• Algorithm: Continuous-valued Features
– Numberless values taken by a continuous-valued feature
– Conditional probability often modeled with the normal
distribution 1  ( x j   ji ) 2 
Pˆ ( x j | ci )  exp  
2  ji  2 ji 
2

 ji : mean (avearage) of feature values x j of examples for which c  ci
 ji : standard deviation of feature values x j of examples for which c  ci
for X  ( X ,  , X ), C  c ,  , c
1 n 1 L
– Learning Phase:
n L P(C  ci ) i  1,  , L
Output: normal distributions and
  
– Test Phase: Given an unknown instanceX  ( a1 ,  , an )
• Instead of looking-up tables, calculate conditional probabilities with all
the normal distributions achieved in the learning phrase
• Apply the MAP rule to assign a label (the same as done for the discrete
case)
NAÏVE BAYES
• Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class

1 N 1 N Yes  21.64 , Yes  2.35

   xn ,    ( xn   )
2 2
 No  23.88 , No  7.09
N n1 N n1
– Learning Phase: output two Gaussian models for P(temp|C)

ˆ 1  ( x  21 . 64 ) 2
 1  ( x  21 . 64) 2

P ( x | Yes )  exp    exp  
2.35 2  2  2.35  2.35 2
2
 11.09 

ˆ 1  ( x  23 .88 ) 2
 1  ( x  23. 88 ) 2

P ( x | No)  
exp  
  
exp  
7.09 2  2  7.09  7.09 2
2
 50.25 
ZERO CONDITIONAL PROBABILITY

• If no example contains the feature value

– In this circumstance, we face a zero conditional probability
problem during test
Pˆ ( x1 | ci )    Pˆ (a jk | ci )    Pˆ ( xn | ci )  0 for x j  a jk , Pˆ (a jk | ci )  0
– For a remedy, class conditional probabilities re-estimated with

ˆ nc  mp
P (a jk | ci )  (m-estimate)
nm
nc : number of training examples for which x j  a jk and c  ci
n : number of training examples for which c  ci
p : prior estimate (usually, p  1 / t for t possible values of x j )
m : weight to prior (number of " virtual" examples, m  1)
ZERO CONDITIONAL PROBABILITY

• Example: P(outlook=overcast|no)=0 in the play-tennis

dataset
– Adding m “virtual” examples (m: up to 1% of #training
example)
• In this dataset, # of training examples for the “no” class is 5.
• We can only add m=1 “virtual” example in our m-esitmate remedy.
– The “outlook” feature can takes only 3 values. So p=1/3.
– Re-estimate P(outlook|no) with the m-estimate
SUPPORT VECTOR MACHINES
The Scalar Product
a The scalar or dot product is, in some sense, a measure of
Similarity

b
a  b  a b cos

DECISION FUNCTION
FOR BINARY
CLASSIFICATION f (x)  R
f ( xi )  0  yi  1
f xi   0  yi  1
SUPPORT VECTOR MACHINES

 SVMs pick best separating hyperplane according

to some criterion
e.g. maximum margin
 Training process is an optimisation

 Training set is effectively reduced to a relatively

small number of support vectors
FEATURE SPACES
 We may separate data by mapping to a higher-
dimensional feature space
The feature space may even have an infinite
number of dimensions!
 We need not explicitly construct the new feature
space
KERNELS
 We may use Kernel functions to implicitly map
to a new feature space
 Kernel fn:
K x1 , x 2  R
 Kernelmust be equivalent to an inner product in
some feature space
EXAMPLE KERNELS

Linear: xz

Polynomial: P x  z 

Gaussian:  2
exp  x  z /  2 
PERCEPTRON REVISITED: LINEAR
SEPARATORS
 Binary classification can be viewed as the task of
separating classes in feature space:
wTx + b = 0
wTx + b > 0
wTx + b < 0

f(x) = sign(wTx + b)
WHICH OF THE LINEAR SEPARATORS
IS OPTIMAL?
BEST LINEAR SEPARATOR?
BEST LINEAR SEPARATOR?
BEST LINEAR SEPARATOR?
BEST LINEAR SEPARATOR?
FIND CLOSEST POINTS IN CONVEX HULLS

d
c
PLANE BISECT CLOSEST POINTS

wT x + b =0
w=d-c
d
c
CLASSIFICATION MARGIN
wT x  b
 Distance from example data to the separator is r
w
 Data closest to the hyperplane are support vectors.

 Margin ρ of the separator is the width of separation between

classes. ρ

r
MAXIMUM MARGIN CLASSIFICATION
 Maximizing the margin is good according to intuition and
theory.
 Implies that only support vectors are important; other training
examples are ignorable.
STATISTICAL LEARNING THEORY
 Misclassificationerror and the function
complexity bound generalization error.
 Maximizing margins minimizes complexity.

 “Eliminates” overfitting.

 Solution depends only on Support Vectors not

number of attributes.
MARGINS AND COMPLEXITY

Skinny margin
is more flexible
thus more complex.
MARGINS AND COMPLEXITY

Fat margin
is less complex.
LINEAR SVM MATHEMATICALLY
 Assuming all data is at distance larger than 1 from the
hyperplane, the following two constraints follow for a
training set {(xi ,yi)}
wTxi + b ≥ 1 if yi = 1
wTxi + b ≤ -1 if yi = -1

 For support vectors, the inequality becomes an

equality; then, since each example’s distance from the
wT x  b 
2
r
w w
 hyperplane is the margin is:
LINEAR SVMS MATHEMATICALLY
(CONT.)
 Then we can formulate the quadratic optimization
problem:
Find w and b such that
2

w is maximized and for all {(xi ,yi)}
wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1

Find w and b such that

Φ(w) =½ wTw is minimized and for all {(xi ,yi)}

yi (wTxi + b) ≥ 1
SOLVING THE OPTIMIZATION PROBLEM
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1

 Quadratic optimization problems are a well-known class of

mathematical programming problems, and many (rather
intricate) algorithms exist for solving them.
 The solution involves constructing a dual problem where a
Lagrange multiplier αi is associated with every constraint in the
primary problem:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
THE OPTIMIZATION PROBLEM
SOLUTION
 The solution has the form:

w =Σαiyixi b= yk- wTxk for any xk such that αk 0

 Each non-zero αi indicates that corresponding xi is a support vector.

 Then the classifying function will have the form:
f(x) = ΣαiyixiTx + b

 Notice that it relies on an inner product between the test point x

and the support vectors xi – we will return to this later!
 Also keep in mind that solving the optimization problem involved
computing the inner products xiTxj between all training points!
SOFT MARGIN CLASSIFICATION
 What if the training set is not linearly separable?
 Slack variables ξi can be added to allow misclassification

of difficult or noisy examples.

ξi
ξi
SOFT MARGIN CLASSIFICATION
MATHEMATICALLY
 The old formulation:
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1

 The new formulation incorporating slack variables:

Find w and b such that
Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i

 Parameter C can be viewed as a way to control overfitting.

SOFT MARGIN CLASSIFICATION –
SOLUTION
 The dual problem for soft margin classification:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi

 Neither slack variables ξi nor their Lagrange multipliers

appear in the dual problem!
 Again, xi with non-zero αi will be support vectors.

 Solution to the dual problem is:

w =Σαiyixi k f(x) = ΣαiyixiTx + b
b= yk(1- ξk) - wTxk where k = argmax αk
THEORETICAL JUSTIFICATION FOR
MAXIMUM MARGINS
 Vapnik has proved the following:
The class of optimal linear separators has VC dimension h bounded
from above as  D 2  
h  min 2  , m 1

  
0

where ρ is the margin, D is the diameter of the smallest sphere that
can enclose all of the training examples, and m0 is the
dimensionality.

 Intuitively, this implies that regardless of dimensionality m0 we can

minimize the VC dimension by maximizing the margin ρ.

 Thus, complexity of the classifier is kept small regardless of

dimensionality.
LINEAR SVMS: OVERVIEW
 The classifier is a separating hyperplane.
 Most “important” training points are support vectors; they
define the hyperplane.
 Quadratic optimization algorithms can identify which training
points xi are support vectors with non-zero Lagrangian
multipliers αi. f(x) = ΣαiyixiTx + b
 Both in the dual formulation of the problem and in the solution
training points appear only inside inner products:
EXAMPLE
EXAMPLE
NON-LINEAR SVMS
 Datasets that are linearly separable with some noise
work out great:

0 x

 But what are we going to do if the dataset is just too

hard?

0 x
 How about… mapping data to a higher-dimensional
space: x2

0 x
NONLINEAR CLASSIFICATION

x   a, b 
xw  w1a  w2b

 ( x)   a, b, ab, a , b 
2 2

 ( x)w  w1a  w2b  w3ab  w4 a  w5b

2 2
NON-LINEAR SVMS: FEATURE SPACES
 General idea: the original feature space can always be
mapped to some higher-dimensional feature space where
the training set is separable:
Φ: x → φ(x)
THE “KERNEL TRICK”
 The linear classifier relies on inner product between vectors
K(xi,xj)=xiTxj
 If every datapoint is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
 A kernel function is some function that corresponds to an inner product
into some feature space.
 Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] =
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
Positive Definite Matrices
A square matrix A is positive definite if xTAx>0 for
all nonzero column vectors x.

It is negative definite if xTAx < 0 for all nonzero x.

It is positive semi-definite if xTAx  0.

And negative semi-definite if xTAx  0 for all x.

WHAT FUNCTIONS ARE KERNELS?
 For some functions K(xi,xj) checking that K(xi,xj)= φ(xi)
T
φ(xj) can be cumbersome.
 Mercer’s theorem:
Every semi-positive definite symmetric function is a
kernel
 Semi-positive definite symmetric functions correspond
to a semi-positive definite symmetric Gram matrix:
K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xN)
K= K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xN)

… … … … …
K(xN,x1) K(xN,x2) K(xN,x3) … K(xN,xN)
EXAMPLES OF KERNEL FUNCTIONS
 Linear: K(xi,xj)= xi Txj

 Polynomial of power p: K(xi,xj)= (1+ xi Txj)p 2

xi x j

2 2
e
 Gaussian (radial-basis function network): K(xi,xj)=

 Two-layer perceptron: K(xi,xj)= tanh(β0xi Txj + β1)

NON-LINEAR SVMS MATHEMATICALLY
 Dual problem formulation:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi

 The
f(x) solution
= Σα y K(xis:
, x )+ b
i i i j

 Optimization techniques for finding αi’s remain the same!

EXAMPLE
SVM APPLICATIONS
 SVMs were originally proposed by Boser, Guyon and Vapnik in 1992
and gained increasing popularity in late 1990s.

 SVMs are currently among the best performers for a number of

classification tasks ranging from text to genomic data.

 SVM techniques have been extended to a number of tasks such as

regression [Vapnik et al. ’97], principal component analysis
[Schölkopf et al. ’99], etc.

 Most popular optimization algorithms for SVMs are SMO [Platt ’99]
and SVMlight [Joachims’ 99], both use decomposition to hill-climb over
a subset of αi’s at a time.

 Tuning SVMs remains a black art: selecting a specific kernel and

parameters is usually done in a try-and-see manner.
SVM EXTENSIONS
 Regression

 VariableSelection
 Boosting

 Density Estimation

 Unsupervised Learning
Novelty/Outlier Detection
Feature Detection
Clustering
LEARNING ENSEMBLES
 Learn multiple alternative definitions of a concept using
different training data or different learning algorithms.
 Combine decisions of multiple definitions, e.g. using
weighted voting.
Training Data

Data1 Data2  Data m

Learner1 Learner2  Learner m

Model1 Model2  Model m

Model Combiner Final Model

VALUE OF ENSEMBLES
 When combing multiple independent and diverse
decisions each of which is at least more accurate than
random guessing, random errors cancel each other out,
correct decisions are reinforced.
 Human ensembles are demonstrably better
 How many jelly beans in the jar?: Individual estimates vs.
group average.
 Who Wants to be a Millionaire: Expert friend vs. audience
vote.
HOMOGENOUS ENSEMBLES
 Use a single, arbitrary learning algorithm but
manipulate training data to make it learn
multiple models.
 Data1  Data2  …  Data m
 Learner1 = Learner2 = … = Learner m
 Different methods for changing training data:
 Bagging: Resample training data
 Boosting: Reweight training data
 DECORATE: Add additional artificial training data
 In WEKA, these are called meta-learners, they
take a learning algorithm as an argument (base
learner) and create a new learning algorithm.
BAGGING
 Create ensembles by repeatedly randomly resampling the
training data (Brieman, 1996).
 Given a training set of size n, create m samples of size n
by drawing n examples from the original data, with
replacement.
 Each bootstrap sample will on average contain 63.2% of the
unique training examples, the rest are replicates.
 Combine the m resulting models using simple majority
vote.
 Decreases error by decreasing the variance in the results
due to unstable learners, algorithms (like decision trees)
whose output can change dramatically when the training
data is slightly changed.
BOOSTING
 Originally developed by computational learning theorists
to guarantee performance improvements on fitting
training data for a weak learner that only needs to
generate a hypothesis with a training accuracy greater
than 0.5 (Schapire, 1990).
 Revised to be a practical algorithm, AdaBoost, for
building ensembles that empirically improves
generalization performance (Freund & Shapire, 1996).
 Examples are given weights. At each iteration, a new
hypothesis is learned and the examples are reweighted to
focus the system on examples that the most recently
learned classifier got wrong.
BOOSTING: BASIC ALGORITHM
 General Loop:
Set all examples to have equal uniform weights.
For t from 1 to T do:
Learn a hypothesis, ht, from the weighted examples
Decrease the weights of examples ht classifies correctly
 Base (weak) learner must focus on correctly classifying the
most highly weighted examples while strongly avoiding over-
fitting.
 During testing, each of the T hypotheses get a weighted vote
proportional to their accuracy on the training data.
ADABOOST PSEUDOCODE
TrainAdaBoost(D, BaseLearn)
For each example di in D let its weight wi=1/|D|
Let H be an empty set of hypotheses
For t from 1 to T do:
Learn a hypothesis, ht, from the weighted examples: ht=BaseLearn(D)
Add ht to H
Calculate the error, εt, of the hypothesis ht as the total sum weight of the
examples that it classifies incorrectly.
If εt > 0.5 then exit loop, else continue.
Let βt = εt / (1 – εt )
Multiply the weights of the examples that ht classifies correctly by βt
Rescale the weights of all of the examples so the total sum weight remains 1.
Return H

TestAdaBoost(ex, H)
Let each hypothesis, ht, in H vote for ex’s classification with weight log(1/ βt )
Return the class with the highest weighted vote total.
LEARNING WITH WEIGHTED
EXAMPLES
 Generic approach is to replicate examples in the
training set proportional to their weights (e.g. 10
replicates of an example with a weight of 0.01
and 100 for one with weight 0.1).
 Most algorithms can be enhanced to efficiently
incorporate weights directly in the learning
algorithm so that the effect is the same (e.g.
implement the WeightedInstancesHandler
interface in WEKA).
 For decision trees, for calculating information
gain, when counting example i, simply
increment the corresponding count by wi rather
than by 1.
EXAMPLE

Original training set: equal weights to all training samples

Taken from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire

ADABOOST EXAMPLE
ε = error rate of classifier
α = weight of classifier
ROUND 1
ADABOOST EXAMPLE
ROUND 2
ADABOOST EXAMPLE

ROUND 3
ADABOOST EXAMPLE
HOW IS CLASSIFIER COMBINING
DONE?
 At each stage we select the best classifier on the current iteration
and combine it with the set of classifiers learned so far

 How are the classifiers combined?

 Take the weight*feature for each classifier, sum these up, and
compare to a threshold (very simple)

 Boosting algorithm automatically provides the appropriate weight for

each classifier and the threshold

 This version of boosting is known as the AdaBoost algorithm

 Some nice mathematical theory shows that it is in fact a very

powerful machine learning technique
EVALUATING MACHINE LEARNING
ALGORITHMS

•You have developed a machine learning approach to a certain

task, and want to validate that it actually works well (or

determine if it doesn't work well)

• Standard: approach, divide data into training and testing sets,

train method on training set, and report results on the testing set

• Important: testing set is not the same as the validation test

EVALUATING MACHINE LEARNING
ALGORITHMS

The proper way to evaluate an ML algorithm

1. Break all data into training/testing sets (e.g., 70%/30%)
2. Break training set into training/validation set (e.g., 70%/30%
again)
3. Choose hyperparameters using validation set
4. (Optional) Once we have selected hyperparameters, retrain
using all the training set
5. Evaluate performance on the testing set
CLASSIFIER EVALUATION METRICS: CONFUSION
MATRIX
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:

Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

 Given m classes, an entry, CMi,j in a confusion matrix indicates

# of tuples in class i that were labeled by the classifier as class j
 May have extra rows/columns to provide totals
94
CLASSIFIER EVALUATION METRICS:
ACCURACY, ERROR RATE, SENSITIVITY AND
SPECIFICITY
A\P C ¬C  Class Imbalance Problem:
C TP FN P
 One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the

 Classifier Accuracy, or negative class and minority of

recognition rate: percentage of test the positive class
 Sensitivity: True Positive
set tuples that are correctly
classified recognition rate
 Sensitivity = TP/P
Accuracy = (TP + TN)/All
 Error rate: 1 – accuracy, or  Specificity: True Negative

Error rate = (FP + FN)/All recognition rate

 Specificity = TN/N

95
CLASSIFIER EVALUATION METRICS:
PRECISION AND RECALL, AND F-MEASURES
 Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

 Recall: completeness – what % of positive tuples did the

classifier label as positive?
 Perfect score is 1.0
 Inverse relationship between precision & recall
 F measure (F or F-score): harmonic mean of precision and
1
recall,

 Fß: weighted measure of precision and recall

 assigns ß times as much weight to recall as to precision

96
CLASSIFIER EVALUATION METRICS: EXAMPLE

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

 Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

97
EVALUATING CLASSIFIER ACCURACY:
HOLDOUT & CROSS-VALIDATION METHODS
 Holdout method
 Given data is randomly partitioned into two independent sets
 Training set (e.g., 2/3) for model construction
 Test set (e.g., 1/3) for accuracy estimation

 Random sampling: a variation of holdout

 Repeat holdout k times, accuracy = avg. of the accuracies obtained
 Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
 At i-th iteration, use D as test set and others as training set
i
 Leave-one-out: k folds where k = # of tuples, for small sized
data
 *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data

98
EVALUATING CLASSIFIER ACCURACY:
BOOTSTRAP

 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with replacement
 i.e., each time a tuple is selected, it is equally likely to be selected again
and re-added to the training set
 Several bootstrap methods, and a common one is .632 boostrap
 A data set with d tuples is sampled d times, with replacement, resulting in a
training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since (1
– 1/d)d ≈ e-1 = 0.368)
 Repeat the sampling procedure k times, overall accuracy of the model:

99
DEBUGGING
DEBUGGING

AISE Anchor Bolt Details PDF
100% (1)
AISE Anchor Bolt Details PDF
1 page
ACPE Application Forms - Template
No ratings yet
ACPE Application Forms - Template
33 pages
Logistic Regression
No ratings yet
Logistic Regression
61 pages
Lecture 11 Logistic
No ratings yet
Lecture 11 Logistic
19 pages
Unit-2 MLT
No ratings yet
Unit-2 MLT
84 pages
Unit 3
No ratings yet
Unit 3
99 pages
ML Unit 2
No ratings yet
ML Unit 2
107 pages
Slide07 Bayes
No ratings yet
Slide07 Bayes
51 pages
Datamining Lect12
No ratings yet
Datamining Lect12
75 pages
Logistic Regression (Probability Concepts) and Perceptron
No ratings yet
Logistic Regression (Probability Concepts) and Perceptron
20 pages
DA Unit 2
No ratings yet
DA Unit 2
124 pages
ML - Unit 2
No ratings yet
ML - Unit 2
155 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
79 pages
07 Intro To ML
No ratings yet
07 Intro To ML
38 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
7 Logistic-Regression
No ratings yet
7 Logistic-Regression
63 pages
Data Mining - Module 7
No ratings yet
Data Mining - Module 7
8 pages
Maintaining Training Facilities
No ratings yet
Maintaining Training Facilities
97 pages
Naive Bayes
No ratings yet
Naive Bayes
25 pages
LOGIQ P9P7 R3 User Guide - English - UM - 5791624-100 - 3
No ratings yet
LOGIQ P9P7 R3 User Guide - English - UM - 5791624-100 - 3
343 pages
Logistic Regression
No ratings yet
Logistic Regression
26 pages
Output 25
No ratings yet
Output 25
8 pages
Lec 05
No ratings yet
Lec 05
53 pages
Murphy Book Solution
No ratings yet
Murphy Book Solution
100 pages
Output 23
No ratings yet
Output 23
6 pages
Logistic Regression and Naive Bayes
No ratings yet
Logistic Regression and Naive Bayes
4 pages
Linearclassification
No ratings yet
Linearclassification
31 pages
ML-chap10 2024 110300
No ratings yet
ML-chap10 2024 110300
29 pages
Version 1
No ratings yet
Version 1
18 pages
Naive Bayes
No ratings yet
Naive Bayes
31 pages
Probabilistic Class I Fiers
No ratings yet
Probabilistic Class I Fiers
5 pages
04 Lecturenote MLE MAP Discriminative
No ratings yet
04 Lecturenote MLE MAP Discriminative
6 pages
Supervised Learning Cheatsheet
No ratings yet
Supervised Learning Cheatsheet
4 pages
Lec 20
No ratings yet
Lec 20
16 pages
Unit 3 LOGISTIC
No ratings yet
Unit 3 LOGISTIC
7 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Classification - Naive Bayes
No ratings yet
Classification - Naive Bayes
17 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
Logistic Regression Explained
No ratings yet
Logistic Regression Explained
15 pages
ESGB - Naive Bayes and Logistic Regression
No ratings yet
ESGB - Naive Bayes and Logistic Regression
36 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
Lecture 5-Naïve Bayes
No ratings yet
Lecture 5-Naïve Bayes
26 pages
MISY 631 Final Review Calculators Will Be Provided For The Exam
No ratings yet
MISY 631 Final Review Calculators Will Be Provided For The Exam
9 pages
Week 4 - Classification Alternative Techniques
No ratings yet
Week 4 - Classification Alternative Techniques
87 pages
Naïve Bayes Classifier: Adopted From Slides by Ke Chen From University of Manchester and Yangqiu Song From Msra
No ratings yet
Naïve Bayes Classifier: Adopted From Slides by Ke Chen From University of Manchester and Yangqiu Song From Msra
25 pages
Deploy DFS on Windows Server 2012 R2
No ratings yet
Deploy DFS on Windows Server 2012 R2
53 pages
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
No ratings yet
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
53 pages
Lec12 PDF
No ratings yet
Lec12 PDF
9 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
W8 - Logistic Regression
No ratings yet
W8 - Logistic Regression
18 pages
CMU ML Homework: MLE & MAP Analysis
No ratings yet
CMU ML Homework: MLE & MAP Analysis
10 pages
04 Probability and Learning PDF
No ratings yet
04 Probability and Learning PDF
34 pages
Naïve Bayes Classifier: Ke Chen
No ratings yet
Naïve Bayes Classifier: Ke Chen
19 pages
Naïve Bayes for Continuous Data
No ratings yet
Naïve Bayes for Continuous Data
16 pages
Unit 3
No ratings yet
Unit 3
9 pages
Linear Classification: 1 1 N N I D I
No ratings yet
Linear Classification: 1 1 N N I D I
33 pages
Audi 80/90 Wiring Diagram Guide
No ratings yet
Audi 80/90 Wiring Diagram Guide
20 pages
Bayesian Decision Theory in ML
No ratings yet
Bayesian Decision Theory in ML
56 pages
Bayesian Learning: Based On "Machine Learning", T. Mitchell, Mcgraw Hill, 1997, Ch. 6
No ratings yet
Bayesian Learning: Based On "Machine Learning", T. Mitchell, Mcgraw Hill, 1997, Ch. 6
54 pages
Machine Learning for Mechanics
No ratings yet
Machine Learning for Mechanics
19 pages
AI with Python: Supervised vs. Unsupervised Learning
No ratings yet
AI with Python: Supervised vs. Unsupervised Learning
14 pages
Machine Learning UNIT-2: Logistic Regression
No ratings yet
Machine Learning UNIT-2: Logistic Regression
12 pages
Recruitment Process of Ufone, Telenor and Zong
100% (3)
Recruitment Process of Ufone, Telenor and Zong
29 pages
Log-Linear Models and Conditional Random Fieldsels
No ratings yet
Log-Linear Models and Conditional Random Fieldsels
27 pages
Advanced Textile Ironing Solutions
No ratings yet
Advanced Textile Ironing Solutions
24 pages
AKSA Battery Charger
No ratings yet
AKSA Battery Charger
2 pages
Aws Kms Best Practices PDF
No ratings yet
Aws Kms Best Practices PDF
24 pages
Swanti Satsangi
No ratings yet
Swanti Satsangi
1 page
CS8501 R17 NovDec 20
No ratings yet
CS8501 R17 NovDec 20
2 pages
PT0-003 Updated Dumps - CompTIA PenTest+ Exam
No ratings yet
PT0-003 Updated Dumps - CompTIA PenTest+ Exam
28 pages
DLL Ict 10
100% (1)
DLL Ict 10
3 pages
Refrigeration & HVAC Expert Resume
No ratings yet
Refrigeration & HVAC Expert Resume
3 pages
Module - 2
No ratings yet
Module - 2
130 pages
Unit 5 Full Notes
No ratings yet
Unit 5 Full Notes
30 pages
Proc 471 Definition of Powers and Duties
No ratings yet
Proc 471 Definition of Powers and Duties
36 pages
Abtik Group
No ratings yet
Abtik Group
23 pages
3ms Third Test
No ratings yet
3ms Third Test
4 pages
Designing Input Filter
No ratings yet
Designing Input Filter
31 pages
OUC DC 911 Follow Up
No ratings yet
OUC DC 911 Follow Up
2 pages
Week 10 Module 6 Product Development
No ratings yet
Week 10 Module 6 Product Development
25 pages
Solving Wicked Problems in Construction
No ratings yet
Solving Wicked Problems in Construction
13 pages
Snowflake Adapter For SAP Integration Suite
No ratings yet
Snowflake Adapter For SAP Integration Suite
41 pages
06 BBMD
No ratings yet
06 BBMD
7 pages
Hussein 2015
No ratings yet
Hussein 2015
4 pages
Ada Boost Optimizes Wave Energy Arrays
No ratings yet
Ada Boost Optimizes Wave Energy Arrays
6 pages
Planning Pack 2016
No ratings yet
Planning Pack 2016
48 pages