0% found this document useful (0 votes)

15 views111 pages

Module 3 - Classification

The document provides an overview of classification in data mining, detailing the process of building a model from a training set to classify unseen records accurately. It discusses various classification techniques, including Bayesian classifiers and decision trees, and explains the use of Bayes' theorem in predicting class membership probabilities. Additionally, it highlights the advantages and disadvantages of the Naïve Bayesian classifier and introduces the concept of decision trees for classification tasks.

Uploaded by

ahujaayush973

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views111 pages

Module 3 - Classification

Uploaded by

ahujaayush973

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 111

1

CLASSIFICATION
CLASSIFICATION: DEFINITION
2

 Given a collection of records ( training set )

 Each record contains a set of attributes, one of
 the attributes is the class.
 Find a model for class attribute as a function of the
values of other attributes.
 Goal: previously unseen records should be assigned a
 class as accurately as possible.
 A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
CLASSIFICATION
3

 Maps data into predefined groups or classes.

 Two step process

 Training set

 A model built describing a predetermined set of

data classes
 Supervised learning
 Use model for classification

 Accuracy of the model is first estimated.

 Then classify/ predict the data.
ILLUSTRATING CLASSIFICATION
TASK
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No Learning
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No

Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes

Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction

14 No Small 95K ?

15 No Large 67K ?
10

Test Set
6
Process (1): Model Construction
5

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier

M ike A ssistan t P ro f 3 no (Model)
M ary A ssistan t P ro f 7 yes
B ill P ro fesso r 2 yes
Jim A sso ciate P ro f 7 yes
IF rank = ‘professor’
D ave A ssistan t P ro f 6 no
OR years > 6
Anne A sso ciate P ro f 3 no
THEN tenured = ‘yes’
Process (2): Using the Model in Prediction
6

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
CLASSIFICATION TECHNIQUES
7

 Decision Tree based Methods

 Rule-based Methods
 Memory based reasoning
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines
Bayesian Classification: Why?
8
 A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities

 Foundation: Based on Bayes’ Theorem.

 Performance: A simple Bayesian classifier, naïve Bayesian

classifier, has comparable performance with decision tree
and selected neural network classifiers

 Incremental: Each training example can incrementally

increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
data
Data Mining: Concepts and Techniques April 21, 2018
Bayesian Theorem: Basics
9
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), the probability that the
hypothesis holds given the observed data sample X(tuple
probability that tuple X belongs to class C according to some
feature of X)
 P(H) (prior probability), the initial probability
 E.g., X will buy computer, regardless of age, income, …

 P(X): probability that sample data is observed

 P(X|H) (posteriori probability), the probability of observing the
sample X, given that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
Data Mining: Concepts and Techniques April 21, 2018
Bayesian Theorem
10

 Given training data X, posteriori probability of a hypothesis

H, P(H|X), follows the Bayes theorem

P(H | X)  P(X | H )P(H )

P(X)
 Informally, this can be written as
posteriori = likelihood x prior/evidence
 Predicts X belongs to C2 iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
 Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
Data Mining: Concepts and Techniques April 21, 2018
Towards Naïve Bayesian Classifier
11
 Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)

 Since P(X) is constant for all classes, only

P(C | X)  P(X | C )P(C )
i i i
needs to be maximized
Data Mining: Concepts and Techniques April 21, 2018
Naïve Bayesian Classifier: Training Dataset
Class

age income studentcredit_rating

buys_computer
12
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = >40 medium no fair yes
‘yes’
C2:buys_computer = ‘no’
>40 low yes fair yes
>40 low yes excellent no
Data sample 31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium,
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Data Mining: Concepts and Techniques April 21, 2018
Naïve Bayesian Classifier: An
Example
13  P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 Step 1
P(buys_computer = “no”) = 5/14= 0.357

 Compute P(X|Ci) for each class

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 Step 2

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6

P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444

P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4

P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667

P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2

P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667

P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

Data Mining: Concepts and Techniques April 21, 2018

14  X = (age <= 30 , income = medium, student = yes, credit_rating
= fair)
Step 3:
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667
= 0.044

P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

Step 4:
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”)
= 0.028

P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)

Avoiding the 0-Probability Problem
15
 Naïve Bayesian prediction requires each conditional prob. be non-zero.
Otherwise, the predicted prob. will be zero
n
P( X | C i )   P( x k | C i )
 Laplacian correction k 1
 Ex. Suppose for a class buys_computer = yes in some training dataset has 1000
tuples,
 With 0 tuples for income=low ,
 990 for income= medium
 and
 10 for income = high
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
 The “corrected” prob. estimates are close to their “uncorrected” counterparts

Data Mining: Concepts and Techniques April 21, 2018

Naïve Bayesian Classifier: Comments
16

 Advantages
 Easy to implement
 Good results obtained in most of the cases

 Disadvantages
 Assumption: class conditional independence, therefore loss of
accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
 Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
 How to deal with these dependencies?
 Bayesian Belief Networks
Data Mining: Concepts and Techniques April 21, 2018
Example-Naïve Bayes
17 • Example: Play Tennis

x’=(Outlook=S
unny,
Temperatu
re=Cool,
Humidity=
High,
Wind=Stro
ng)
Example
18 • Learning Phase
Outlook Play=Yes Play=No Temperatur Play=Yes Play=No
Sunny e
2/9 3/5
Hot 2/9 2/5
Overcast 4/9 0/5
Mild 4/9 2/5
Rain 3/9 2/5
Cool 3/9 1/5
Humidity Play=Ye Play=N Wind Play=Yes Play=No
s o
Strong 3/9 3/5
High 3/9 4/5 Weak 6/9 2/5
Normal 6/9 1/5

P(Play=Yes) = 9/14 P(Play=No) = 5/14

Example
19 • Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

Decision tree
20

 A decision tree is a flow chart like tree

structure, where
 Each internal node denotes a test on an attribute

branch denotes an outcome of the test

 Each
 Each leaf node represent a class

 In order to classify an unknown sample, the

attribute values of the sample are tested against
the decision tree.
EXAMPLE OF A DECISION TREE
21

Splitting Attributes
T id R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat

1 Y es S in g le 125K No

2 No M a r r ie d 100K No Refund
Yes No
3 No S in g le 70K No

4 Y es M a r r ie d 120K No NO MarSt
5 No D iv o r c e d 95K Y es
Single, Divorced Married
6 No M a r r ie d 60K No

7 Y es D iv o r c e d 220K No
TaxInc NO
8 No S in g le 85K Y es < 80K > 80K
9 No M a r r ie d 75K No
NO YES
10 No S in g le 90K Y es
10

Training Data Model: Decision Tree

ANOTHER EXAMPLE OF DECISION
TREE
22

MarSt Single,
Married Divorced
T id R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat

NO Refund
1 Y es S in g le 125K No
Yes No
2 No M a r r ie d 100K No

3 No S in g le 70K No NO TaxInc
4 Y es M a r r ie d 120K No < 80K > 80K
5 No D iv o r c e d 95K Y es
NO YES
6 No M a r r ie d 60K No

7 Y es D iv o r c e d 220K No

8 No S in g le 85K Y es

9 No M a r r ie d 75K No There could be more than one tree that fits

10 No S in g le 90K Y es the same data!
10
DECISION TREE CLASSIFICATION
TASK
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No

Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes

Model
10

Training Set
Apply
Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
APPLY MODEL TO TEST DATA
Start from the root of tree. R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat