Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views111 pages

Module 3 - Classification

The document provides an overview of classification in data mining, detailing the process of building a model from a training set to classify unseen records accurately. It discusses various classification techniques, including Bayesian classifiers and decision trees, and explains the use of Bayes' theorem in predicting class membership probabilities. Additionally, it highlights the advantages and disadvantages of the Naïve Bayesian classifier and introduces the concept of decision trees for classification tasks.

Uploaded by

ahujaayush973
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views111 pages

Module 3 - Classification

The document provides an overview of classification in data mining, detailing the process of building a model from a training set to classify unseen records accurately. It discusses various classification techniques, including Bayesian classifiers and decision trees, and explains the use of Bayes' theorem in predicting class membership probabilities. Additionally, it highlights the advantages and disadvantages of the Naïve Bayesian classifier and introduces the concept of decision trees for classification tasks.

Uploaded by

ahujaayush973
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

1

CLASSIFICATION
CLASSIFICATION: DEFINITION
2

 Given a collection of records ( training set )


 Each record contains a set of attributes, one of
 the attributes is the class.
 Find a model for class attribute as a function of the
values of other attributes.
 Goal: previously unseen records should be assigned a
 class as accurately as possible.
 A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
CLASSIFICATION
3

 Maps data into predefined groups or classes.


 Two step process

 Training set

 A model built describing a predetermined set of


data classes
 Supervised learning
 Use model for classification

 Accuracy of the model is first estimated.


 Then classify/ predict the data.
ILLUSTRATING CLASSIFICATION
TASK
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No Learning
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
6
Process (1): Model Construction
5

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


M ike A ssistan t P ro f 3 no (Model)
M ary A ssistan t P ro f 7 yes
B ill P ro fesso r 2 yes
Jim A sso ciate P ro f 7 yes
IF rank = ‘professor’
D ave A ssistan t P ro f 6 no
OR years > 6
Anne A sso ciate P ro f 3 no
THEN tenured = ‘yes’
Process (2): Using the Model in Prediction
6

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
CLASSIFICATION TECHNIQUES
7

 Decision Tree based Methods


 Rule-based Methods
 Memory based reasoning
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines
Bayesian Classification: Why?
8
 A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities

 Foundation: Based on Bayes’ Theorem.

 Performance: A simple Bayesian classifier, naïve Bayesian


classifier, has comparable performance with decision tree
and selected neural network classifiers

 Incremental: Each training example can incrementally


increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
data
Data Mining: Concepts and Techniques April 21, 2018
Bayesian Theorem: Basics
9
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), the probability that the
hypothesis holds given the observed data sample X(tuple
probability that tuple X belongs to class C according to some
feature of X)
 P(H) (prior probability), the initial probability
 E.g., X will buy computer, regardless of age, income, …

 P(X): probability that sample data is observed


 P(X|H) (posteriori probability), the probability of observing the
sample X, given that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
Data Mining: Concepts and Techniques April 21, 2018
Bayesian Theorem
10

 Given training data X, posteriori probability of a hypothesis


H, P(H|X), follows the Bayes theorem

P(H | X)  P(X | H )P(H )


P(X)
 Informally, this can be written as
posteriori = likelihood x prior/evidence
 Predicts X belongs to C2 iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
 Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
Data Mining: Concepts and Techniques April 21, 2018
Towards Naïve Bayesian Classifier
11
 Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)

 Since P(X) is constant for all classes, only


P(C | X)  P(X | C )P(C )
i i i
needs to be maximized
Data Mining: Concepts and Techniques April 21, 2018
Naïve Bayesian Classifier: Training Dataset
Class

age income studentcredit_rating


buys_computer
12
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = >40 medium no fair yes
‘yes’
C2:buys_computer = ‘no’
>40 low yes fair yes
>40 low yes excellent no
Data sample 31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium,
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Data Mining: Concepts and Techniques April 21, 2018
Naïve Bayesian Classifier: An
Example
13  P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 Step 1
P(buys_computer = “no”) = 5/14= 0.357

 Compute P(X|Ci) for each class

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 Step 2


P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6

P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444


P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4

P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667


P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2

P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667


P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

Data Mining: Concepts and Techniques April 21, 2018


14  X = (age <= 30 , income = medium, student = yes, credit_rating
= fair)
Step 3:
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667
= 0.044

P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019


Step 4:
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”)
= 0.028

P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)


Avoiding the 0-Probability Problem
15
 Naïve Bayesian prediction requires each conditional prob. be non-zero.
Otherwise, the predicted prob. will be zero
n
P( X | C i )   P( x k | C i )
 Laplacian correction k 1
 Ex. Suppose for a class buys_computer = yes in some training dataset has 1000
tuples,
 With 0 tuples for income=low ,
 990 for income= medium
 and
 10 for income = high
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
 The “corrected” prob. estimates are close to their “uncorrected” counterparts

Data Mining: Concepts and Techniques April 21, 2018


Naïve Bayesian Classifier: Comments
16

 Advantages
 Easy to implement
 Good results obtained in most of the cases

 Disadvantages
 Assumption: class conditional independence, therefore loss of
accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
 Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
 How to deal with these dependencies?
 Bayesian Belief Networks
Data Mining: Concepts and Techniques April 21, 2018
Example-Naïve Bayes
17 • Example: Play Tennis

x’=(Outlook=S
unny,
Temperatu
re=Cool,
Humidity=
High,
Wind=Stro
ng)
Example
18 • Learning Phase
Outlook Play=Yes Play=No Temperatur Play=Yes Play=No
Sunny e
2/9 3/5
Hot 2/9 2/5
Overcast 4/9 0/5
Mild 4/9 2/5
Rain 3/9 2/5
Cool 3/9 1/5
Humidity Play=Ye Play=N Wind Play=Yes Play=No
s o
Strong 3/9 3/5
High 3/9 4/5 Weak 6/9 2/5
Normal 6/9 1/5

P(Play=Yes) = 9/14 P(Play=No) = 5/14


Example
19 • Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.


Decision tree
20

 A decision tree is a flow chart like tree


structure, where
 Each internal node denotes a test on an attribute

branch denotes an outcome of the test


 Each
 Each leaf node represent a class

 In order to classify an unknown sample, the


attribute values of the sample are tested against
the decision tree.
EXAMPLE OF A DECISION TREE
21

Splitting Attributes
T id R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat

1 Y es S in g le 125K No

2 No M a r r ie d 100K No Refund
Yes No
3 No S in g le 70K No

4 Y es M a r r ie d 120K No NO MarSt
5 No D iv o r c e d 95K Y es
Single, Divorced Married
6 No M a r r ie d 60K No

7 Y es D iv o r c e d 220K No
TaxInc NO
8 No S in g le 85K Y es < 80K > 80K
9 No M a r r ie d 75K No
NO YES
10 No S in g le 90K Y es
10

Training Data Model: Decision Tree


ANOTHER EXAMPLE OF DECISION
TREE
22

MarSt Single,
Married Divorced
T id R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat

NO Refund
1 Y es S in g le 125K No
Yes No
2 No M a r r ie d 100K No

3 No S in g le 70K No NO TaxInc
4 Y es M a r r ie d 120K No < 80K > 80K
5 No D iv o r c e d 95K Y es
NO YES
6 No M a r r ie d 60K No

7 Y es D iv o r c e d 220K No

8 No S in g le 85K Y es

9 No M a r r ie d 75K No There could be more than one tree that fits


10 No S in g le 90K Y es the same data!
10
DECISION TREE CLASSIFICATION
TASK
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
APPLY MODEL TO TEST DATA
Start from the root of tree. R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat

No M a r r ie d 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
APPLY MODEL TO TEST DATA
R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat

No M a r r ie d 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
APPLY MODEL TO TEST DATA
R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat

No M a r r ie d 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
APPLY MODEL TO TEST DATA
R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat

No M a r r ie d 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
APPLY MODEL TO TEST DATA
R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat

No M a r r ie d 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
APPLY MODEL TO TEST DATA
R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat

No M a r r ie d 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
DECISION TREE
CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Model
Decision
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
DECISION TREE INDUCTION
Many Algorithms:
 Hunt’s Algorithm (one of the earliest)
 CART (Classification And Regression trees)
 ID3 (Iterative dichotomiser), C4.5
 SLIQ,SPRINT
TREE INDUCTION

Greedy strategy.
 Split the records based on an attribute test that
optimizes certain criterion.

Issues
 Determine how to split the records
 How to specify the attribute test condition?
 How to determine the best split?
 Determine when to stop splitting
TREE INDUCTION
Issues
 Determine how to split the records
 How to specify the attribute test condition?
 How to determine the best split?
 Determine when to stop splitting
HOW TO SPECIFY TEST
CONDITION?
Depends on attribute types
 Nominal
 Ordinal
 Continuous

Depends on number of ways to split


 2-way split
 Multi-way split
SPLITTING BASED ON NOMINAL
ATTRIBUTES
Multi-way split: Use as many partitions as distinct values.

CarType
Family Luxury
Sports
Binar y split: Divides values into two subsets.
Need to find optimal partitioning.

CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}
Splitting based on ordinal
attributes
Multi-way split: Use as many par titions as distinct values.

Size
Small Large

Binar y split: Divides vMaedluiuemsinto two subsets.


Need to find optimal par titioning.

What about this split?


Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}

Size
{Small,
Large} {Medium}
3
6
SPLITTING BASED ON CONTINUOUS
ATTRIBUTES
Different ways of handling
 Discretization to form an ordinal categorical attribute
 Static – discretize once at the beginning
 Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.

 Binary Decision: (A < v) or (A  v)


 consider all possible splits and finds the best cut
 can be more compute intensive
3
7
Splitting based on continuous attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


TREE INDUCTION
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
HOW TO DETERMINE THE BEST
SPLIT
Before Splitting: 10 records of class 0,
10 records of class 1
Own Car Student
Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best? The measures developed for selecting the
best split are often based on the degree of impurity of child nodes. Impurity
measures include Entropy(t), Gini(t), Classification error(t) etc.
HOW TO DETERMINE THE BEST SPLIT

Greedy approach:
 Nodes with homogeneous class distribution are
preferred
Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
CHOOSING ATTRIBUTES
• The order in which attributes are chosen determines how
complicated the tree is.
• ID3 uses information theor y to determine the most
informative attribute.
• A measure of the information content of a message is the
inverse of the probability of receiving the message:
information(M) = 1/probability(M)

• Taking logs (base 2) makes information correspond to the


number of bits required to encode a message:
information(M) = -log 2(probability (M))
ENTROPY

• Different messages have different probabilities of


arrival.
• Overall level of uncertainty (termed entropy) is:
-Σi P i log 2 P i
• Frequency can be used as a probability estimate.
• E.g. if there are 5 positive examples and 3 negative
examples in a node the estimated probability of
positive is 5 / 8 = 0.625.
SPLITTING CRITERION

Work out entropy based on distribution of


classes.
Trying splitting on each attribute.
Work out expected information gain for each
attribute.
Choose best attribute.
ID3 ALGORITHM

• Constructs a decision tree by using top-down


recursive approach.
• Main aim is to choose that splitting
attribute which is having the highest
information gain.
• The tree starts as a single node, representing
all the training samples
• If all samples are of the same class, then the
node becomes a leaf node and is labeled
with that class.
• Else, an entropy -based algorithm, known as
information gain, is used for selecting the
attribute that will best separate the
samples into individual classes. This
attribute becomes the test attribute at the
node.
• Branch is now constructed for each value of
the test attribute and samples are
partitioned accordingly.
• The algorithm then recursively applies
the same process to form a decision tree
for samples at each node.
• This recursive partitioning stops when
either:
• It is a leaf node
• Splitting attributes is over
CALCULATE INFORMATION GAIN

Entropy:
Given a collection S of c outcomes
Entropy(S) = Σ -p(I) log2 p(I)
Where p(I) is the proportion of S belonging to
class I. Σ is over c. Log2 is log base 2.
Gain(S, A) is information gain of example set S on
attribute A is defined as

 Gain(S, A) = Entropy(S) - Σ ((|S v | / | S | ) * Entropy(Sv))

Where:
 S is each value v of all possible values of attribute A
 Sv = subset of S for which attribute A has value v
 | S v | = number of elements in Sv
 | S | = number of elements in S

4
Formula for ID3 algorithm
50

 The expected information needed to classify a


tuple in D is given by

Entropy(S) =
TRAINING SET
RID Age Income Student Credit Buys
1 <30 High No Fair No
2 <30 High No Excellent No
3 31-40 High No Fair Yes
4 >40 Medium No Fair Yes
5 >40 Low Yes Fair Yes
6 >40 Low Yes Excellent No
7 31-40 Low Yes Excellent Yes
8 <30 Medium No Fair No
9 <30 Low Yes Fair Yes
10 >40 Medium Yes Fair Yes
11 <30 Medium Yes Excellent Yes
12 31-40 Medium No Excellent Yes
13 31-40 High Yes Fair Yes
14 >40 Medium No Excellent No
S is a collection of 14 examples with 9 YES and 5 NO
examples then
Entropy(S) = - (9/14) Log 2 (9/14) - (5/14) Log 2 (5/14)
= 0.940
Let us consider A ge as the splitting attribute
Gain(S, Age) = Entropy(S) - (5/14)*Entropy(S <30 )
- (4/14)*Entropy(S 31-40 )
- (5/14)*Entropy(S >40 )
= 0.940 - (5/14)*0.971 – (4/ 14)*0 – (5/14)*0.971
= 0.246
Similarly consider Student as the splitting attribute
Gain(S, Student) = Entropy(S) - (7/14)*Entropy(S YES )
- (7/14)*Entropy(S NO )
= 0.151
Similarly consider Credit as the splitting attribute
Gain(S, Credit) = Entropy(S) - (8/14)*Entropy(S FAIR )
- (6/14)*Entropy(S EXCELLENT )
= 0.048
Similarly consider Income as the splitting attribute
Gain(S, Income) = Entropy(S) - (4/14)*Entropy(S HIGH )
- (6/14)*Entropy(S MED )
- (4/14)*Entropy(S LOW )
= 0.029
We see that Gain (S, A ge) is max with 0.246.
Hence A ge is chosen as the splitting attribute.

Age

<30 >40

31-40
5 samples 5 samples
YES
55
Recursively we find the splitting attributes at the next level.
Let us consider Student as the next splitting attribute
Gain(Age<30, Student) = Entropy(Age<30) - (2/5)*Entropy(S YES )
- (3/5)*Entropy(S NO )
= 0.971 - ( 2 /5 )*0 – ( 3 / 5 ) * 0
= 0.971
Similarly Credit as the next splitting attribute
Gain(Age<30, Credit) = Entropy(Age<30) - (3/5)*Entropy(S FAIR )
- (2/5)*Entropy(S EXCELLENT)
= 0.0202
Similarly Income as the next splitting attribute
Gain(Age<30, Income) = Entropy(Age<30) - (2/5)*Entropy(S HIGH)
- (2/5)*Entropy(S MED ) - (1/5)*Entropy(S LOW )
= 0.571
We see that Gain (Age<30, Student) is max with
0.971.
Hence Student is chosen as the next splitting
attribute.
Similarly we find that Gain (Age>40, Credit) is
max. Age

Hence the tree:<30 >40

31-40
Stude
Credit
nt YES
excellent
yes no fair

NO YES
YES NO
EXTRACTING CLASSIFICATION
RULES FROM TREES
Represent the knowledge in the form of IF -THEN rules
One rule is created for each path from the root to a leaf
Each attribute -value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer =
“no”
CONSTRUCT A DECISION TREE FOR THE
BELOW EXAMPLE:

Suppose we want ID3 to decide whether the


weather is amenable to playing baseball. Over
the course of 2 weeks, data is collected to help
ID3 build a decision tree. The target
classification is "should we play baseball?"
which can be yes or no.
The weather attributes can have the following
values:
 outlook = { sunny, overcast, rain }
 temperature = {hot, mild, cool }
 humidity = { high, normal }
 wind = {weak, strong }
Table 1
Day Outlook Temperatur Humidity Wind Play ball
e
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
We need to find which attribute will be the root node in our
decision tree. The gain is calculated for all four attributes:
Gain(S, Outlook) = 0.246
Gain(S, Temperature) = 0.029
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048

Gain(S,Wind)=Entropy(S) -(8/14)*Entropy( Sweak)-


(6/14)*Entropy( Sstrong)
= 0.940 - ( 8 / 1 4 ) *0 . 8 11 - ( 6 / 1 4 ) * 1 .00
= 0.048
Entropy(Sweak) = - ( 6 / 8 ) * l o g 2 ( 6 /8 ) - ( 2 / 8 ) * l o g 2 ( 2 / 8 ) =
0.811
Entropy(Sstrong) = - ( 3 / 6 ) * l o g 2 ( 3 / 6 ) - ( 3 / 6 ) * l o g 2 ( 3 /6 ) = 1
.00
Outlook attribute has the highest gain, therefore it is used as
the decision attribute in the root node.
Since Outlook has three possible values, the root node has
three branches (sunny, overcast, rain). The next question is
"what attribute should be tested at the Sunny branch node?"
Since we have used Outlook at the root, we only decide on the
remaining three attributes: Humidity, Temperature, or Wind.
Ssunny = {D1 , D2, D8, D9, D11} = 5 examples from table 1
with outlook = sunny
Gain(Ssunny, Humidity) = 0.970
Gain(Ssunny, Temperature) = 0.570
Gain(Ssunny, Wind) = 0.019
Humidity has the highest gain; therefore, it is used as the
decision node. This process goes on until all data is classified
perfectly or we run out of attributes.
63
Tree Pruning

 Some branches of the decision tree may reflect anomalies due to


noise/ outliers.
 Overfitting results in decision trees that are more complex than
necessary
 Training error no longer provides a good estimate of how well the
tree will perform on previously unseen records
 Tree pruning helps in faster classification with more accurate
results
 Two methods
 Pre-pruning: halting its construction during formation
 Post-pruning: remove branches from a fully grown tree.

64
Tree Pruning
65

 Pre-Pruning (Early Stopping Rule)


 Stop the algorithm before it becomes a fully-grown tree
 Typical stopping conditions for a node:
 Stop if all instances belong to the same class
 Stop if all the attribute values are the same
 More restrictive conditions:
 Stop if number of instances is less than some user-
specified threshold
 Stop if class distribution of instances are independent of
the available features (e.g., using  2 test)
 Stop if expanding the current node does not improve
impurity measures (e.g., Gini or information gain).
66
67

 Post-pruning
 Grow decision tree to its entirety
 Trim the nodes of the decision tree in a bottom-up fashion
 If generalization error improves after trimming, replace sub-
tree by a leaf node.
 Class label of leaf node is determined from majority class of
instances in the sub-tree
 A combined approach may be used as no single
 pruning method found to be superior than the other

 Decision trees suffer from repetition and replication


 which can be solved by multivariate splits
68
WHAT IS PREDICTION?
79
 (Numerical) prediction is similar to classification
 construct a model
 use model to predict continuous or ordered
value for a given input
 Prediction is different from classification
 Classification refers to predict categorical class label
 Prediction models continuous-valued functions
 Major method for prediction: regression
 model the relationship between one or more independent or
predictor variables and a dependent or response variable
 Regression analysis
 Linear and multiple regression
 Non-linear regression
 Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
LINEAR REGRESSION
70
 L i n e a r r e g r e s s i o n: involves a response variable y and a single
predictor variable x
y = w0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
 M e t h o d o f l e a s t s q u a r e s: estimates the best-fitting straight
line |D |

 ( x i  x )( y i  y )
W1  i1 w0  y  w1 x
|D |

 ( xi  x )
2

i 1

 M u l t i p l e l i n e a r r e g r e s s i o n: involves more than one predictor


variable
 Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
 Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
 Solvable by extension of least square method or using SAS, S-Plus
 Many nonlinear functions can be transformed into the above
Example: Tips for service

Meal(#) TIP Amount ( Rs. )


Tip Amount(Rs)
1 5.00 20
2 17.00
15
Tip Amount
3 11.00 Best Fit Line
10
4 8.00 Tip
5 14.00 5 Amount(Rs)
6 5.00 0
0 1 2 3 4 5 6 7
Meals (#)
y  10
With Only one variable, and no other information, the best prediction
for next measurement is the mean of sample itself. The variability in
Tips amount can be explained by the tips themselves
“GOODNESS OF FIT” FOR THE
75
TIPS

Meal(#) TIP Amount ( Rs. )

1 5.00
2 17.00
3 11.00
4 8.00
5 14.00 RESIDUALS
6 5.00

y  10
SQUARING THE RESIDUALS(ERRORS)
76

Why square the residuals ?


1. they become positive Sum of Squared Errors (SSE) = 120
2. emphasizes larger deviations.
SQUARING THE RESIDUALS(ERRORS)
77

The Goal of Simple Linear regression is to create a linear model that minimizes the
sum of squares of residuals / Errors (SSE)
If Our regression model is Significant it will “Eat up” much of the raw SSE.
OUR AIM SHOULD BE TO DEVELOP A REGRESSION LINE WHICH WILL FIT BETTER WITH
MINIMUM SSE
QUICK REVIEW

78

 If there is only one variable , the best prediction is mean.


 The difference between actual value and predicted value is
RESIDUAL or ERROR.
 The RESIDUALS ARE SQUARED to Obtain SSE.
 Simple linear regression is designed to find the best fitting line
through data to minimizes the SSE.
BIVARIATE STATISTICS

Variable 2
CORRELATION 30
20
Variable
10 2
0
ANOVA 0 10

LINEAR REGRESSION
y = m.X +c

Value of one variable , is a function of the other variable.


Value of y is a function of x i.e. y = f(x)
X is independent variable , Y is Dependent variable
SOME CONCEPTS FROM ALGEBRA

81

 Y= m. X +b
 X – Random variable
 m = slope of line = rise / run
 b = y intercept when x = 0.
SIMPLE LINEAR REGRESSION MODEL

82

y = ß0 + ß1 X+ ɛ
Where

ß0 = y intercept population parameter

ß1 = slope population parameter

ɛ = error term in y variation ( unknown reasons)

Simple regression model is E(y) = ß0 + ß1 X


GENERAL REGRESSION LINES

83

 Accuracy of E(y) values depends upon mean value of y and


also on distribution of y .
REGRESSION EQUATION WITH ESTIMATES

84
WHEN THE SLOPE , B 1 = 0

85
GETTING READY FOR LEAST SQUARES

86

TIPS IN Rs. TIPS IN RS. Meal (#) Bill Amount (Rs) TIP
18 Amount (
Rs. )
16 1 34.00 5.00
14 2 108.00 17.00

12 3 64.00 11.00
TIPS IN RS.

10 4 88.00 8.00

8 5 99.00 14.00
6 51.00 5.00
6
4
2 We want to know what degree
the tip amount can be
0 predicted by the Bill.
0 25 50 75 100 125
TIP is Dependent variable Bill
is Independent variable
LEAST SQUARE CRITERIA

87
STEP I : DRAW SCATTER PLOT

88

TIPS IN Rs. TIPS IN RS.


18
Meal (#) Bill Amount (Rs) TIP
16 Amount (
Rs. )
14
1 34.00 5.00
12
TIPS IN RS.

2 108.00 17.00
10
3 64.00 11.00
8 4 88.00 8.00
6 5 99.00 14.00
4 6 51.00 5.00

2
0
0 25 50 75 100 125
STEP II : LOOK FOR A VISUAL LINE

89

TIPS IN Rs. TIPS IN RS.


18
16
14
12
TIPS IN RS.

10
8
6
4
2
0
0 25 50 75 100 125
STEP III : CORREALATION( OPTIONAL)

TIPS IN Rs. TIPS IN RS.


18 What is correlation coefficient r ?
16
14
In our example r = 0.866
12
TIPS IN RS.

10
8 Is relationship strong ?
6
4
YES in our case ! !!
2
0
0 25
PEARSON CORRELATION

91

 Correlation between sets of data is a measure of how well


they are related. The most common measure of correlation in stats is
the Pearson Correlation. The full name is the Pearson Product
Moment Correlation or PPMC. It shows the l i n e a r
r e l a t i o n s h i p between two sets of data. In simple terms, it
answers the question, Can I draw a line graph to represent the data?
Two letters are used to represent the Pearson correlation: Greek letter
rho (ρ) for a population and the letter “r” for a sample.
 The Pearson correlation coefficient can be calculated by hand or one
a graphing calculator such as the TI -89

PEARSON CORRELATION

92

The results will be between -1 and 1. You will very rarely see 0, -1 or 1. You’ll get a
number somewhere in between those values. The closer the value of r gets to zero,
the greater the variation the data points are around the line of best fit.
High correlation: .5 to 1.0 or -0.5 to 1.0.5.
Medium correlation: .3 to .5 or -0.3 to .
Low correlation: .1 to .3 or -0.1 to -0.3.
STEP IV : DESCRIPTIVE STATISTICS
/CENTROID
 TIPS TIPS IN RS.
IN Rs.
18 Meal (#) Bill Amount (Rs) TIP
Amount (
16 Rs. )
14 1 34.00 5.00
2 108.00 17.00
TIPS IN RS.

12
3 64.00 11.00
10
4 88.00 8.00
8
5 99.00 14.00
6
6 51.00 5.00
4 X bar = 74 Y bar =
10
2
0
0 25 50 75 100 125
Total bil in Rs 93
STEP V : CALCULATIONS

94
CALCULATION

95
CALCULATIONS

96

MEA Total Bill Tip Amt. Bill Tip Deviation Bill


L
Deviation Deviation Products Deviation
Squared
B1 CALCULATIONS

97
B0 CALCULATION

98
YOUR REGRESSION LINE

99
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
 Given m classes, an entry, CMi,j in a confusion matrix
indicates # of tuples in class i that were labeled by the classifier
as class j
 May have extra rows/columns to provide totals
96
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P C ¬C  Class Imbalance Problem:
C TP FN P
 One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the

 Classifier Accuracy, or negative class and minority of


recognition rate: percentage of the positive class
test set tuples that are  Sensitivity: True Positive

correctly classified recognition rate


Accuracy = (TP + TN)/All  Sensitivity = TP/P

 Error rate: 1 – accuracy, or  Specificity: True Negative

Error rate = (FP + FN)/All recognition rate


 Specificity = TN/N

97
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
 Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

 Recall: completeness – what % of positive tuples did the


classifier label as positive?
 Perfect score is 1.0
 Inverse relationship between precision & recall
 F measure (F1 or F-score): harmonic mean of precision
and recall,

 Fß: weighted measure of precision and recall


 assigns ß times as much weight to recall as to precision

99
EXAMPLE

Actual Class\Predicted class cancer = yes cancer = no Total


cancer = yes 90 210 300
cancer = no 140 9560 9700
Total 230 9770 10000
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.50 (accuracy)

 Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

101
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
 Holdout method
 Given data is randomly partitioned into two independent
sets
 Training set (e.g., 2/3) for model construction
 Test set (e.g., 1/3) for accuracy estimation
 Random sampling: a variation of holdout
 Repeat holdout k times, accuracy = avg. of the accuracies
obtained
 Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
 At i-th iteration, use Di as test set and others as training set
 Leave-one-out: k folds where k = # of tuples, for small
sized data
 *Stratified cross-validation*: folds are stratified so that
class dist. in each fold is approx. the same as that in the
initial data
102
Evaluating Classifier Accuracy:
Bootstrap
 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with replacement
 i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
 Several bootstrap methods, and a common one is .632 boostrap
 A data set with d tuples is sampled d times, with replacement,
resulting in a training set of d samples. The data tuples that did not
make it into the training set end up forming the test set. About 63.2%
of the original data end up in the bootstrap, and the remaining 36.8%
form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
 Repeat the sampling procedure k times, overall accuracy of the
model:
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
 Suppose we have 2 classifiers, M1 and M2, which one is
better?
 Use 10-fold cross-validation to obtain and
 These mean error rates are just estimates of error on the true
population of future data cases
 What if the difference between the 2 error rates is just
attributed to chance?
 Use a test of statistical significance
 Obtain confidence limits for our error estimates

104
Estimating Confidence Intervals:
Null Hypothesis
 Perform 10-fold cross-validation
 Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
 Use t-test (or Student’s t-test)
 Null Hypothesis: M1 & M2 are the same
 If we can reject null hypothesis, then
 we conclude that the difference between M1 & M2 is
statistically significant
 Chose model with lower error rate

105
Ensemble Methods: Increasing the
Accuracy
106

 Ensemble methods
 Use a combination of models to increase accuracy
 Combine a series of k learned models, M1, M2, …, Mk,
with the aim of creating an improved model M*
 Popular ensemble methods
 Bagging: averaging the prediction over a collection of
classifiers
 Boosting: weighted vote with a collection of classifiers
 Ensemble: combining a set of heterogeneous classifiers
Bagging: Boostrap Aggregation
107

 Analogy: Diagnosis based on multiple doctors’ majority vote


 Training
 Given a set D of d tuples, at each iteration i, a training set Di of d
tuples is sampled with replacement from D (i.e., bootstrap)
 A classifier model Mi is learned for each training set Di

 Classification: classify an unknown sample X


 Each classifier Mi returns its class prediction
 The bagged classifier M* counts the votes and assigns the class
with the most votes to X
 Prediction: can be applied to the prediction of continuous values by
taking the average value of each prediction for a given test tuple
 Accuracy
 Often significantly better than a single classifier derived from D
 For noise data: not considerably worse, more robust
 Proved improved accuracy in prediction
Boosting
108
 Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
 How boosting works?
 Weights are assigned to each training tuple
 A series of k classifiers is iteratively learned
 After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
 The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
 Boosting algorithm can be extended for numeric prediction
 Comparing with bagging: Boosting tends to have greater
accuracy, but it also risks overfitting the model to misclassified
data
Adaboost (Freund and Schapire, 1997)
109

 Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)


 Initially, all the weights of tuples are set the same (1/d)
 Generate k classifiers in k rounds. At round i,
 Tuples from D are sampled (with replacement) to form a training
set Di of the same size
 Each tuple’s chance of being selected is based on its weight
 A classification model Mi is derived from Di
 Its error rate is calculated using Di as a test set
 If a tuple is misclassified, its weight is increased, o.w. it is
decreased
 Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier
Mi error rate is the sum of the weights
d of the misclassified tuples:
error( M i )   w j  err ( X j )
j

 The weight of classifier Mi’s vote is log


1  error( M i )
error( M i )
Random Forest (Breiman 2001)
110
 Random Forest:
 Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to
determine the split
 During classification, each tree votes and the most popular class is
returned
 Two Methods to construct Random Forest:
 Forest-RI (random input selection): Randomly select, at each node,
F attributes as candidates for the split at the node. The CART
methodology is used to grow the trees to maximum size
 Forest-RC (random linear combinations): Creates new attributes
(or features) that are a linear combination of the existing attributes
(reduces the correlation between individual classifiers)
 Comparable in accuracy to Adaboost, but more robust to errors and
outliers
 Insensitive to the number of attributes selected for consideration at
each split, and faster than bagging or boosting
Classification of Class-Imbalanced Data Sets
111

 Class-imbalance problem: Rare positive example but numerous


negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
 Traditional methods assume a balanced distribution of classes
and equal error costs: not suitable for class-imbalanced data
 Typical methods for imbalance data in 2-class classification:
 Oversampling: re-sampling of data from positive class
 Under-sampling: randomly eliminate tuples from negative
class
 Threshold-moving: moves the decision threshold, t, so that
the rare class tuples are easier to classify, and hence, less
chance of costly false negative errors.
 Ensemble techniques: Ensemble multiple classifiers introduced
above
 Still difficult for class imbalance problem on multiclass tasks

You might also like