0% found this document useful (0 votes)

83 views56 pages

Classification: Basic Concepts and Decision Trees

The document discusses classification tasks and decision trees. It defines classification as assigning records to a class based on attribute values. A model is built from a training set and tested on a separate test set. Decision trees classify records by splitting them into branches based on attribute values until they reach a leaf node assigning a class. The tree is applied to new records to predict their class.

Uploaded by

Md Shazman Anwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views56 pages

Classification: Basic Concepts and Decision Trees

Uploaded by

Md Shazman Anwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Classification: Basic Concepts and

Decision Trees
A programming task
Classification: Definition
Given a collection of records (training
set )
Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a
function of the values of other
attributes.
Goal: previously unseen records should
be assigned a class as accurately as
possible.
A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10

Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Examples of Classification Task
Predicting tumor cells as benign or malignant

Classifying credit card transactions

as legitimate or fraudulent

Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random
coil

Categorizing news stories as finance,

weather, entertainment, sports, etc
Classification Using Distance
Place items in class to which they are
“closest”.
Must determine distance between an
item and a class.
Classes represented by
Centroid: Central value.
Medoid: Representative point.
Individual points
Algorithm: KNN
K Nearest Neighbors
K Nearest Neighbors
Advantage
Nonparametric architecture
Simple
Powerful
Requires no training time
Disadvantage
Memory intensive
Classification/estimation is slow

7
K Nearest Neighbors
The key issues involved in training this
model includes setting
the variable K
Validation techniques (ex. Cross validation)
the type of distant metric
Euclidean measure

D 2

Dist ( X , Y ) = ∑ ( Xi − Yi)
i =1

8
Figure K Nearest Neighbors Example

Stored training set patterns

X input pattern for classification
--- Euclidean distance measure to 9

the nearest three patterns

Store all input data in the training set

For each pattern in the test set

Search for the K nearest patterns to the

input pattern using a Euclidean distance
measure

For classification, compute the confidence for

each class as Ci /K,
(where Ci is the number of patterns among
the K nearest patterns belonging to class i.)
The classification for the input pattern is the 10
class with the highest confidence.
Training parameters and typical settings
Number of nearest neighbors
The numbers of nearest neighbors (K) should
be based on cross validation over a number of
K setting.

When k=1 is a good baseline model to

benchmark against.

A good rule-of-thumb numbers is k should be

less than the square root of the total number
of training patterns.
11
Training parameters and typical settings
Input compression

Since KNN is very storage intensive, we may

want to compress data patterns as a
preprocessing step before classification.

Using input compression usually results in

slightly worse performance.

Normalization is performed to equalize the

effect of each attribute.

12
Classification Techniques
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No Refund
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

Another Example of Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10

Apply Decision
Model Tree
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K