BIG DATA ANALYTICS
Lecture 9 --- Week 10
Content
Classification versus Regression
Supervised vs. Unsupervised Learning
Evaluating Predictive Models
Supervised Learning Algorithms
Model Evaluation
Classification vs Regression
Classification:
predicts categorical class labels
classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying new
data.
Regression:
models continuous-valued functions, i.e., predicts unknown or missing
values.
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
Classification – A Motivating
Application
Credit approval
A bank wants to classify its customers based on whether they
are expected to pay back their approved loans
The history of past customers is used to train the classifier
The classifier provides rules, which identify potentially reliable
future customers
Classification rule:
If age = “31...40” and income = high then credit_rating = excellent
Future customers
Paul: age = 35, income = high excellent credit rating
John: age = 20, income = medium fair credit rating
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees, or
mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test samples is compared with the classified result
from the model
Accuracy rate is the percentage of test set samples that are correctly
classified by the model
Test set is independent of training set, otherwise over-fitting will occur
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier
Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
Classification Process (2): Use the
Model in Prediction
Accuracy=?
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Mellisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Classification (Training Phase)
In the first step, a classifier is built describing a predetermined set of data
classes or concepts.
This is the learning step (or training phase), where a classification
algorithm builds the classifier by analyzing or “learning from” a training set
made up of database tuples and their associated class labels.
A tuple, X, is represented by an n-dimensional attribute vector, X = (x1,
x2, …. , xn), depicting n measurements made on the tuple from n database
attributes, respectively, A1, A2, ….. , An.
Each tuple, X, is assumed to belong to a predefined class as determined by
another database attribute called the class label attribute.
The class label attribute is discrete-valued and unordered.
It is categorical (or nominal) in that each value serves as a category or class.
The individual tuples making up the training set are referred to as training
tuples and are randomly sampled from the database under analysis. In the
context of classification, data tuples can be referred to as samples, examples,
instances, data points, or objects.
This first step of the classification process can also be viewed as the learning of a
mapping or function, y = f (X), that can predict the associated class label y of a
given tuple X.
In this view, we wish to learn a mapping or function that separates the data
classes.
This mapping is represented in the form of classification rules, decision trees, or
mathematical formulae.
The mapping is represented as classification rules that identify loan
applications as being either safe or risky.
The rules can be used to categorize future data tuples, as well as
provide deeper insight into the data contents.
They also provide a compressed data representation.
Classification (Testing Phase)
In the second step, the model issued for classification.
First, the predictive accuracy of the classifier is estimated.
If we were to use the training set to measure the classifier’s accuracy,
this estimate would likely be optimistic, because the classifier tends
to overfit the data (i.e., during learning it may incorporate some
particular anomalies of the training data that are not present in the
general data set overall).
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
Evaluating Predictive Models
Predictive Accuracy
Speed
time to construct the model
time to use the model
Robustness
handling noise and missing values
Scalability
efficiency in disk-resident databases
Interpretability:
understanding and insight provided by the model
Goodness of rules (quality)
True Positives, True Negatives, False Negatives, False Positives
compactness of classification rules
Supervised Learning Algorithms
Artificial Neural Network
Linear Regression
Support Vector Machine
Artificial Neural Networks
Perceptron
Developed by Frank Rosenblatt by using McCulloch and Pitts model,
perceptron is the basic operational unit of artificial neural networks. It
employs supervised learning rule and is able to classify the data into
two classes.
Operational characteristics of the perceptron: It consists of a single
neuron with an arbitrary number of inputs along with adjustable
weights, but the output of the neuron is 1 or 0 depending upon the
threshold. It also consists of a bias whose weight is always 1.
Following figure gives a schematic representation of the perceptron.
Perceptron thus has the following three basic elements −
Links − It would have a set of connection links, which carries a
weight including a bias always having weight 1.
Adder − It adds the input after they are multiplied with their
respective weights.
Activation function − It limits the output of neuron. The most basic
activation function is a Heaviside step function that has two possible
outputs. This function returns 1, if the input is positive, and 0 for any
negative input.
Linear Regression
Linear regression may be defined as the statistical model that
analyzes the linear relationship between a dependent variable with
given set of independent variables. Linear relationship between
variables means that when the value of one or more independent
variables will change (increase or decrease), the value of dependent
variable will also change accordingly (increase or decrease).
Mathematically the relationship can be represented with the help of
following equation −
Y = mX + b
Here, Y is the dependent variable we are trying to predict
X is the dependent variable we are using to make predictions.
m is the slop of the regression line which represents the effect X
has on Y
b is a constant, known as the Y-intercept. If X = 0,Y would be
equal to b.
Positive Linear Relationship
A linear relationship will be called positive if both independent and
dependent variable increases. It can be understood with the help of
following graph −
Negative Linear relationship
A linear relationship will be called negative if independent increases
and dependent variable decreases. It can be understood with the help
of following graph −
Support Vector Machines
An SVM model is basically a representation of different classes in a
hyper-plane in multidimensional space. The hyper-plane will be
generated in an iterative manner by SVM so that the error can be
minimized. The goal of SVM is to divide the datasets into classes to
find a maximum marginal hyper-plane (MMH).
The followings are important concepts in SVM −
Support Vectors − Data-points that are closest to the hyper-plane is
called support vectors. Separating line will be defined with the help of
these data points.
Hyper-plane − As we can see in the above diagram, it is a decision
plane or space which is divided between a set of objects having
different classes.
Margin − It may be defined as the gap between two lines on the
closet data points of different classes. It can be calculated as the
perpendicular distance from the line to the support vectors. Large
margin is considered as a good margin and small margin is
considered as a bad margin.
The main goal of SVM is to divide the datasets into classes to find a
maximum marginal hyper-plane (MMH) and it can be done in the
following two steps −
First, SVM will generate hyper-planes iteratively that segregates
the classes in best way.
Then, it will choose the hyper-plane that separates the classes
correctly.
Model Evaluation
Metrics for Performance Evaluation
How to evaluate the performance of a model?
Methods for Performance Evaluation
How to obtain reliable estimates?
Methods for Model Comparison
How to compare the relative performance among competing models?
Metrics for Performance Evaluation
Focus on the predictive capability of a model
Rather than how fast it takes to classify or build models, scalability, etc.
Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No
a: TP (true positive)
ACTUAL Class=Yes a b
b: FN (false
CLASS negative)
Class=No c d
c: FP (false
positive)
d: TN (true
Metrics for Performance
Evaluation… PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
(TP) (FN)
CLASS
Class=No c d
(FP) (TN)
Most widely-used metric:
ad TP TN
Accuracy
a b c d TP TN FP FN
Methods of Estimation
Holdout
Reserve 2/3 for training and 1/3 for testing
Random subsampling
One sample may be biased -- Repeated holdout
Cross validation
Partition data into k disjoint subsets
k-fold: train on k-1 partitions, test on the
remaining one
Leave-one-out: k=n
Guarantees that each record is used the same number of times for training
and testing
Bootstrap
Sampling with replacement
~63% of records used for training, ~27% for testing
ROC (Receiver Operating Characteristic)
Developed in 1950s for signal detection theory to
analyze noisy signals
Characterize the trade-off between positive hits and false
alarms
ROC curve plotsTP TPR (on the y-axis) against FPR
TPR
(on the x-axis) PREDICTED CLASS
TP FN
Fraction of positive Yes No
instances predicted as
positive Yes a b
FP Actual (TP) (FN)
FPR No c d
FP TN
(FP) (TN)
Fraction of negative
instances predicted as
positive