Classification and
Prediction
Classification and Prediction
What is classification? What is
regression?
Issues regarding classification and
prediction
Classification by decision tree induction
Scalable decision tree induction
Classification vs. Prediction
Classification:
predicts categorical class labels
classifies data (constructs a model) based on the
training set and the values (class labels) in a classifying
attribute and uses it in classifying new data
Regression:
models continuous-valued functions, i.e., predicts
unknown or missing values
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
Why Classification? A motivating
application
Credit approval
A bank wants to classify its customers based on whether
they are expected to pay back their approved loans
The history of past customers is used to train the
classifier
The classifier provides rules, which identify potentially
reliable future customers
Classification rule:
If age = “31...40” and income = high then credit_rating =
excellent
Future customers
Paul: age = 35, income = high excellent credit rating
John: age = 20, income = medium fair credit rating
Classification—A Two-Step Process
Model construction: describing a set of predetermined
classes
Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test samples is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that
are correctly classified by the model
Test set is independent of training set, otherwise over-
fitting will occur
Classification Process (1):
Model Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier
Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
Classification Process (2): Use
the Model in Prediction
Accuracy=?
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Mellisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Issues regarding classification and
prediction (1): Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
numerical attribute income categorical
{low,medium,high}
normalize all numerical attributes to [0,1)
Issues regarding classification and
prediction (2): Evaluating Classification
Methods
Predictive accuracy
Speed
time to construct the model
time to use the model
Robustness
handling noise and missing values
Scalability
efficiency in disk-resident databases
Interpretability:
understanding and insight provided by the model
Goodness of rules (quality)
decision tree size
compactness of classification rules
Classification by Decision Tree
Induction
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
Tree pruning
Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the decision tree
Training Dataset
age income student credit_rating buys_computer
This <=30 high no fair no
<=30 high no excellent no
follows 31…40 high no fair yes
an >40 medium no fair yes
example >40 low yes fair yes
>40 low yes excellent no
from 31…40 low yes excellent yes
Quinlan’s <=30 medium no fair no
<=30 low yes fair yes
ID3 >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Output: A Decision Tree for
“buys_computer”
age?
<=30 overcast
30..40 >40
student? yes credit rating?
no yes excellent fair
no yes no yes
Scalable Decision Tree Induction Methods
SLIQ (EDBT’96 — Mehta et al.)
Builds an index for each attribute and only class list and the
current attribute list reside in memory
SPRINT (VLDB’96 — J. Shafer et al.)
Constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim)
Integrates tree splitting and tree pruning: stop growing the
tree earlier
RainForest (VLDB’98 — Gehrke, Ramakrishnan &
Ganti)
Builds an AVC-list (attribute, value, class label)
BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan &
Loh)
Uses bootstrapping to create several small samples