FEM 2063 - Data Analytics
CHAPTER 4: Classifications
4.1 Logistic Regression
4.2 Naïve Bayesian
4.3 Discriminant Analysis
1
Overview
➢Logistic Regression
➢Naïve Bayesian
➢Discriminant Analysis
➢Linear Discriminant Analysis
➢Quadratic Discriminant Analysis
2
Classification
Supervised learning or classification: attribution of a class or label to an
observation by exploiting the availability of a training set (labeled data)
Unsupervised learning or clustering: representation of input data in
clusters/classes based on some inherent similarity measures (no training set)
ClassificationTHE
Performance
TOOLS
TP : True Positive FP : False Positive
TN : True Negative FN : False Negative
TN
Specificity =
TN + FP
Confusion Matrix
TP
Sensitivity = = recall = r Actual class
TP + FN P N
Predicted
P TP FP
class
TP + TN
Accuracy = N FN TN
TP + TN + FP + FN
Overview
➢Logistic Regression
➢Naïve Bayesian
➢Discriminant Analysis
➢Linear Discriminant Analysis
➢Quadratic Discriminant Analysis
G. James, D. Witten, T. Hastie, R. Tibshirani, “An Introduction to Statistical Learning with Applications in R”, Springer,
ISBN 978-1-4614-7137-0, ISBN 978-1-4614-7138-7 (eBook) 5
Logistic Regression
Example classifications
• Qualitative variables take values in an unordered set C, such as:
eye color {brown, blue, green}
email {spam, ham}.
• Given a feature vector X and a qualitative response Y taking values in the set C, the
classification task is to build a function C(X) that takes as input the feature vector X and
predicts its value for Y.
• Interested in estimating the probabilities that X belongs to each category in C.
• Probability that an insurance claim is fraudulent, instead of classification fraudulent or not.
Logistic Regression
Example: Credit Card Default
Logistic Regression
Example: Credit Card Default
Simulated Default data set: Individuals with annual income, monthly credit card balance, default on
payment (Yes or No)
Objective: predict whether an individual will default on his or her credit card payment based on
annual income and monthly credit card balance
Logistic Regression
Example: Credit Card Default
Can we use Linear Regression?
• Suppose for the Default classification task that we code
• Can we simply perform a linear regression of Y on X and classify as Yes if
Logistic Regression
Default Income Balance
Can we use Linear Regression? 1 25000 3000
0 35000 1000
1 23000 2700
0 28000 1500
0 30000 1200
For example Default in terms of Balance 0 26000 1400
1 43000 4200
0 34000 580
1 42000 7390
0 26000 245
0 23000 1970
1 29000 2845
1 31000 4656
1 42000 5800
1 30000 900
Logistic Regression
Can we use Linear Regression?
Linear regression does not estimate Pr(Y = 1|X) well (could generate negative values
or values greater than 1 as probability!)
Logistic Regression
What could be used?
Needed : a function that takes values between 0 and 1
For example: Logistic function -- > Logistic regression.
Logistic Regression
• Let’s write p(X) = Pr(Y = 1|X) for short and consider using balance to
predict default in the previous example.
• Logistic regression uses the form
After rearrangement
• This monotone transformation is called the log odds or logit
transformation of p(X).
Logistic Regression
Maximum Likelihood
• We use maximum likelihood to estimate the parameters.
• This likelihood gives the probability of the observed zeros and ones in
the data. Find 0 and 1 to maximize the likelihood of the observed
data.
Logistic Regression
Maximum Likelihood
Example: Flipping a coin. The probability to get a head (H) is p
(- - > probability to get a tail (T) will be (1-p)).
After 3 flips the following result is obtained : HTH
Use maximum likelihood to estimate the parameter p, that best fits the data.
l(p)=p(1-p)p=p^2 – p^3
l’(p)=p(2-3p) l’(p)=0 for p=0 or p=2/3
p=2/3 is the best value for the data
Logistic Regression
Maximum Likelihood
Using a software (e.g. Python, R) to find the Logistic Regression model
Example of output
Logistic Regression
Making Predictions
• What is our estimated probability of default for someone with a balance of $1000?
• With a balance of $2000?
Logistic Regression
Default Income Balance
Example 1: Default in terms of Balance 1 25000 3000
0 35000 1000
1 23000 2700
0 28000 1500
Use the first 10 values to build the model 0 30000 1200
0 26000 1400
1 43000 4200
Check the performance of the model using the 0 34000 580
1 42000 7390
last 5 values
0 26000 245
0 23000 1970
1 29000 2845
1 31000 4656
1 42000 5800
1 30000 900
Logistic Regression
Example 1: Default in terms of Balance
#input the training (first 10) and testing (last 5) data
x = np.array([3000, 1000, 2700, 1500, 1200, 1400, 4200, 580, 7390, 245]).reshape(-1, 1)
y = np.array([1, 0, 1, 0, 0, 0, 1, 0, 1, 0])
xtest = np.array([1970, 2845, 4656, 5800, 900]).reshape(-1, 1)
ytest = np.array([0, 1, 1, 1, 1])
model = LogisticRegression()
model.fit(x, y)
beta0=model.intercept_
beta1=model.coef_
ypred=model.predict(xtest)
#get the accuracy
model.score(xtest, ytest)
Logistic Regression
Example 1: Default in terms of Balance
xtest = np.array([1970, 2845, 4656, 5800, 900]).reshape(-1, 1)
ytest = np.array([0, 1, 1, 1, 1])
ypred=model.predict(xtest): [0 1 1 1 0]
Accuracy=4/5=80%
Logistic Regression
Example 2: Default in terms of Income
#input the training (first 10) and testing (last 5) data
x = np.array([25000, 35000, 23000, 28000, 30000, 26000, 43000, 34000, 42000, 26000]).reshape(-1, 1)
y = np.array([1, 0, 1, 0, 0, 0, 1, 0, 1, 0])
xtest = np.array([23000, 29000, 31000, 42000, 30000]).reshape(-1, 1)
ytest = np.array([0, 1, 1, 1, 1])
model = LogisticRegression()
model.fit(x, y)
beta0=model.intercept_
beta1=model.coef_
ypred=model.predict(xtest)
#get the accuracy
model.score(xtest, ytest)
Logistic Regression
Example 2: Default in terms of Income
ytest = np.array([0, 1, 1, 1, 1])
ypred=model.predict(xtest): [0 0 0 0 0]
Accuracy=1/5=20%
Logistic Regression
More than 2 independent variables
Example of output
Logistic Regression
Example 3: Default in terms of Income and Balance
x = np.array([[25000,3000],[35000,1000],[23000,2700], [28000,1500], [30000,1200], [26000,1400],
[43000,4200],[34000,580],[42000,7390], [26000,245]])
y = np.array([1, 0, 1, 0, 0, 0, 1, 0, 1, 0])
xtest = np.array([[23000,1970], [29000,2845],[31000,4656],[42000,5800],[30000,900]])
ytest = np.array([0, 1, 1, 1, 1])
model = LogisticRegression()
model.fit(x, y)
beta0=model.intercept_
beta1=model.coef_
ypred=model.predict(xtest)
#get the accuracy
model.score(xtest, ytest)
Logistic Regression
Example 3: Default in terms of Income and Balance
ytest = np.array([0, 1, 1, 1, 1])
ypred=model.predict(xtest): [1 1 1 1 0]
Accuracy=3/5=60%
Logistic Regression
Example: South African Heart Disease
• 160 cases of MI (myocardial infarction) and 302 controls (all male in
age range 15-64), from Western Cape, South Africa in early 80s.
• Overall prevalence very high in this region: 5:1%.
• Goal is to identify relative strengths and directions of risk factors.
Logistic Regression
Example: South African Heart Disease
Logistic Regression
Example: South African Heart Disease
Scatterplot matrix of the
South African Heart
Disease data.
The cases (MI) are red,
the controls turquoise.
Logistic Regression
Example: South African Heart Disease
Example of output
Logistic Regression
More than two classes
• A patient presents at the emergency room, and we must classify them according to
their symptoms.
• This coding suggests an ordering, it implies that the difference between stroke and
drug overdose is the same as between drug overdose and epileptic seizure.
• Use Multiclass Logistic Regression
Logistic Regression
More than two classes
• It is easily generalized to more than two classes.
• Multiclass logistic regression is also referred to as multinomial regression.
Logistic Regression
More than two classes
Other option.
• Use twice 2 class Logistic Regression
• For example first (stroke and drug overdose) vs epileptic seizure followed by
stroke vs drug overdose
FEM 2063 - Data Analytics
CHAPTER 4: Classifications
4.1 Logistic Regression
4.2 Naïve Bayes
4.3 Discriminant Analysis
33
Naïve Bayes
Learning objectives:
Understand Naïve Bayes Classifier
Some references http://www3.cs.stonybrook.edu/~cse634/ch6book.pdf
https://www3.cs.stonybrook.edu/~cse634/T14.pdf
Naïve Bayes
Example “Antenna Length” of insects
Naïve Bayes
Example
Histogram of “Antenna Length”
Naïve Bayes
Example
The histograms
The corresponding
normal distributions
Naïve Bayes
Example
• If an antenna is 3 units long, to which kind of insect it belongs to?
• Is it more probable to be Grasshopper or Katydid?
Naïve Bayes
Example
Naïve Bayes
Example
Naïve Bayes
Example
Naïve Bayes
Naïve Bayes (Bayes, Idiot, or Simple) classifier is statistical classifier.
It performs probabilistic prediction.
Idea: Find the probability of a previously unseen instance belonging to each
class, then classify it based on the highest probability.
Bayes Classifiers
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• The classification task is to determine P(H|X), the probability that the hypothesis
holds given the observed data sample X
• P(H) : prior probability (initial probability)
• P(X): probability that sample data is observed
• P(X|H) : posteriori probability (the probability of observing the sample X, given
that the hypothesis holds)
Bayes Classifiers
Given a data X, the posteriori probability of a hypothesis H, noted P(H|X),
follows the Bayes theorem:
𝑷 𝑿 𝑯 𝑷(𝑯)
𝑷 𝑯𝑿 =
𝑷(𝑿)
Informally: posteriori = likelihood x prior/evidence
Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the
P(Ck|X) for all the k classes
Bayes Classifiers
Example
𝑷 𝑿 𝑯 𝑷(𝑯)
𝑷 𝑯𝑿 =
𝑷(𝑿)
Bayes Classifiers
Example
Given the small database with names and sex.
We can apply Bayes theorem
1 Attribute: Name
Bayes Classifiers
Example
𝑷 𝑿 𝑯 𝑷(𝑯)
𝑷 𝑯𝑿 =
𝑷(𝑿)
Bayes Classifiers
Example
Officer Drew is a female!
Bayes Classifiers
Suppose there are m classes Ci , i=1…m.
𝑷 𝑿 𝑯𝒊 𝑷(𝑯𝒊 )
𝑷(𝑯𝒊 𝑿 =
𝑷(𝑿)
Since P(X) is constant for all classes, only 𝑷 𝑿 𝑯𝒊 𝑷(𝑯𝒊 ) need to be compared
Bayes Classifiers
More than one attribute
Example: Height, Eye Color, Hair Length
P(male| Height, Eye, Hair Length ) = P(Height, Eye, Hair Length | male) P(male)
Challenge in computing Probability P(Height, Eye, Hair Length | male) !
Bayes Classifiers
More than one attribute
Assumption: attributes are conditionally independent: Height, Eye Color, Hair
Length
In that case:
P(Height, Eye, Hair Length | male) = P(Height | male)* P(Eye | male) *P(Hair
Length | male)
Using the training set, all probabilities could be calculated and stored in a table
Bayes Classifiers
Example
Over Hair
Sex Prob Eye Prob Prob
170cm length
Yes 1/3 Yes Yes
Male
No 2/3 No No
Yes 2/5 Yes Yes
Female
No 3/5 No No
Bayes Classifiers
• Advantages:
–Fast to train. Fast to classify
–Not sensitive to irrelevant features
–Handles real and discrete data
–Handles streaming data well
• Disadvantage: Assumes independence of features
FEM 2063 - Data Analytics
CHAPTER 4: Classifications
4.3 Discriminant Analysis
54
Discriminant Analysis
Learning objectives:
Understand how Discriminant Analysis works as a classifier
G. James, D. Witten, T. Hastie, R. Tibshirani, “An Introduction to Statistical Learning with Applications in R”, Springer,
ISBN 978-1-4614-7137-0, ISBN 978-1-4614-7138-7 (eBook)
Discriminant Analysis
Aim:
Let X represents predictors (number –p) and Y represents classes ( number -K)
For a given input x of X, we aim to find the probability of it to be in a particular
class k of Y:
Pr( 𝑌 = 𝑘|𝑋 = 𝑥)
Discriminant Analysis
Idea:
• Model the distribution of X in each of the classes separately, and then use Bayes
theorem to obtain
Pr( 𝑌 = 𝑘|𝑋 = 𝑥)
• Use normal (Gaussian) distributions for each class, this leads to linear or
quadratic discriminant analysis.
• Remark: it could be done with other distributions.
Discriminant Analysis
Bayes theorem
Using the notation for , we get
, where
f k ( x) = Pr( X = x | Y = k ) is the (normal) density for X in class k .
k = Pr(Y = k ) is the marginal or prior probability for class k.
Discriminant Analysis
Classify to the highest density
Discriminant Analysis
Linear Discriminant Analysis when there is only 1 predictor (p=1)
• The Gaussian (normal) density has the form
• Here k is the mean, and k2 the variance (in class k).
• We will assume that all the k = are the same.
Discriminant Analysis
Linear Discriminant Analysis when there is only 1 predictor (p=1)
Plugging into Bayes formula, we get:
Discriminant Analysis
Discriminant functions
• To classify the value X = x, we need to find the k which gives the largest pk ( x)
• After simplifications it is equivalent of finding the largest discriminant score:
Note that k ( x) is a linear function of x .
Discriminant Analysis
The decision boundary is the boundary where all classes have the same
discriminant score
Example: If K = 2 (2 classes) and assume that 1 = 2 = 0.5 , then the decision
boundary is at
Discriminant Analysis
Example of decision boundaries:
Discriminant Analysis
Parameter Estimation
Size of training set
Size of class k in the training set
Discriminant Analysis
Linear Discriminant Analysis when p > 1
X= (X1, …, Xp) from a multivariate Gaussian distribution with
• a class specific mean vector
• a common covariance matrix
Notation:
Discriminant Analysis
Linear Discriminant Analysis when p > 1
The density
Discriminant Analysis
Linear Discriminant Analysis when p > 1
The discriminant function
Discriminant Analysis
Illustration: p = 2 and K = 3 classes
Decision boundaries
Solid : LDA
Dashed : Bayes
Discriminant Analysis
Illustration: Fisher's Iris Data
• 4 variables
• 3 species
• 50 samples/class
Discriminant Analysis
Fisher's Discriminant Plot
Discriminant Analysis
Other forms of Discriminant Analysis
• When f k ( x) are Gaussian densities, with the same covariance matrix in
each class, this leads to linear discriminant analysis.
• With Gaussians but different k in each class, we get quadratic
discriminant analysis.
Discriminant Analysis
Quadratic Discriminant Analysis
The discriminant function
Discriminant Analysis
Quadratic Discriminant Analysis
Decision boundaries
Green : QDA
Purple : Bayes
Black : LDA
Summary (chapter 4)
Logistic Regression (LR) Naïve Bayes Linear Discriminant
Analysis (LDA)
Logistic function Independence of
Maximum likelihood Normal distribution
attributes Same covariance
matrices
Quadratic Discriminant
Analysis (QDA)
Normal distribution
Different covariance
matrices
76