Supervised Learning:
Logistic Regression
Support Vector Machine
Decision Tree
Logistic regression: Introduction
• In classification, we seek to identify the categorical class Ck associate with
a given input vector x.
Vehicle features / budget: Buy / Not ? 0: “Negative Class”
Online Transactions: Fraudulent (Yes / No)? 1: “Positive Class”
• In order to predict correct value of Y for a given value of X.
1. Data (samples, combination of X and Y)
2. Model (function to represent relationship X & Y)
3. Cost function (how well our model approximates training samples)
4. Optimization (find parameters of model to minimize cost function)
15/12/2024 School of Computer Science and Engineering 2
Logistic regression- data
• In univariate logistic regression the number of independent variables is
one and there is a linear relationship between the independent(x) and
dependent(y) variable.
Marks scored in Admitted / Not admitted
entrance examination to University
20 Not Admitted
60 Admitted
36 Admitted
32 Not Admitted
30 Not Admitted
80 Admitted
38 Admitted
15/12/2024 School of Computer Science and Engineering 3
Logistic regression: Hypothesis
• Hypothesis used in Linear Regression predicts the continuous values and Logistic
regression hypothesis should predict discrete values.
(Yes) 1
Malignant ?
(No) 0
Tumor Size Tumor Size
Threshold classifier output at 0.5:
If , predict “y = 1”
If , predict “y = 0”
15/12/2024 School of Computer Science and Engineering 4
Logistic regression – decision
boundary
Decision Boundary
x2
3
2
1 2 3 x1
Predict “ “ if
15/12/2024 School of Computer Science and Engineering 5
Logistic regression – decision
boundary
Non-linear decision boundaries
x2
-1 1 x1
Predict “ “ if
-1
15/12/2024 School of Computer Science and Engineering 6
Logistic regression - hypothesis
• The sigmoid function is also called a squashing function as its domain is
the set of all real numbers, and its range is (0, 1).
Need
• For given input, hypothesis always predicts value which is between 0 & 1.
if hθ(x) < 0.5 then consider hθ(x) = 0
else if hθ(x) >= 0.5 then consider hθ(x) = 1
15/12/2024 School of Computer Science and Engineering 7
Logistic regression – hypothesis
Training set:
m examples
15/12/2024 School of Computer Science and Engineering 8
Logistic regression - hypothesis
Interpretation of Hypothesis Output
= estimated probability that y = 1 on input x
Example: If
Tell patient that 70% chance of tumor being malignant
“probability that y = 1, given x, parameterized by ”
15/12/2024 School of Computer Science and Engineering 9
Logistic regression – cost
function
If Y = 1 then, If Y = 0 then,
-log(z) will be zero at hθ(x) =1 -log(1-z) will be zero at hθ(x) = 0
-log(z) –(log(1-z)
15/12/2024 School of Computer Science and Engineering 10
Logistic regression – cost
function
Logistic regression cost function
To fit parameters : To make a prediction given new :
Output
15/12/2024 School of Computer Science and Engineering 11
Logistic regression -
optimization
Gradient Descent
Want :
Repeat
(simultaneously update all )
Repeat
(simultaneously update all )
15/12/2024 School of Computer Science and Engineering 12
Logistic regression – multiclass
Email foldering/tagging: Work, Friends, Family, Hobby
Medical diagrams: Not ill, Cold, Flu
Weather: Sunny, Cloudy, Rain, Snow
15/12/2024 School of Computer Science and Engineering 13
Logistic regression – multiclass
Binary classification: Multi-class classification:
x2 x2
x1 x1
15/12/2024 School of Computer Science and Engineering 14
Logistic regression – multiclass
x2
One-vs-all (one-vs-rest):
x1
x2 x2
x1 x1
x2
Class 1:
Class 2:
Class 3:
x1
15/12/2024 School of Computer Science and Engineering 15
Logistic regression – multiclass
One-vs-all
Train a logistic regression classifier for each
class to predict the probability that .
On a new input , to make a prediction, pick the
class that maximizes
15/12/2024 School of Computer Science and Engineering 16
15/12/2024 School of Computer Science and Engineering 17
Supervised
Learning:
Regularization
Regularization
• Bias explains how much the model has over-fitted on train data.
• Variance explains the difference in the predictions made on train data and
test data.
• In an ideal situation, we need to reduce both bias and variance, which is
where regularization comes in.
15/12/2024 School of Computer Science and Engineering 19
Bias and Variance
Regularization – bias-variance
tradeoff
15/12/2024 School of Computer Science and Engineering 21
Regularization
15/12/2024 School of Computer Science and Engineering 31
Regularization
15/12/2024 School of Computer Science and Engineering 32
Regularization
15/12/2024 School of Computer Science and Engineering 33
Regularization
15/12/2024 School of Computer Science and Engineering 34
Regularization
15/12/2024 School of Computer Science and Engineering 35
Regularization
15/12/2024 School of Computer Science and Engineering 36
Regularization
15/12/2024 School of Computer Science and Engineering 37
Regularization
15/12/2024 School of Computer Science and Engineering 38
Regularization
• Lasso Regression (Least Absolute Shrinkage and
Selection Operator) adds “absolute value of magnitude” of
coefficient as penalty term to the loss function.
• If lambda is zero then we will get back Ordinary Least
Squared whereas very large value will make coefficients
zero hence it will under-fit.
• The key difference between these techniques is that Lasso
shrinks the less important feature’s coefficient to zero
thus, removing some feature altogether. So, this works
well for feature selection
15/12/2024 in Science
School of Computer case we have a huge number
and Engineering 39
Regularization
A regression model that uses L1 regularization technique is
called Lasso Regression and model which uses L2 is
called Ridge Regression.
• Ridge regression adds “squared magnitude” of coefficient
as penalty term to the loss function. Here
the highlighted part represents L2 regularization element.
• If lambda is zero then you can imagine we get back
Ordinary Least Squared.
• If lambda is very large then it will add too much weight and
it will lead to under-fitting. Having said that it’s important
how lambda is chosen. This technique works very well to
avoid over-fitting issue.
15/12/2024 School of Computer Science and Engineering 40
• we can see from the formula of L1 and L2 regularization,
L1 regularization adds the penalty term in cost function
by adding the absolute value of weight(Wj) parameters,
while L2 regularization adds the squared value of
weights(Wj) in the cost function.
• L1 regularization helps in feature selection by eliminating
the features that are not important. This is helpful when
the number of feature points are large in number.
• L1 regularization it will estimate around the median of
the data.
• L2 regularization, while calculating the loss function in
the
15/12/2024
gradient calculation step, the loss function tries
School of Computer Science and Engineering
to41
minimize the loss by subtracting it from the average of
Regularization
• L1 Regularization aka Lasso Regularization:
• This add regularization terms in the model which are function of absolute value of
the coefficients of parameters.
• The coefficient of the parameters can be driven to zero as well during the
regularization process. Hence this technique can be used for feature selection and
generating more parsimonious model.
• L2 Regularization aka Ridge Regularization:
• This add regularization terms in the model which are function of square of
coefficients of parameters. Coefficient of parameters can approach to zero but
never become zero and hence.
• Combination of the above two such as Elastic Nets:
• This add regularization terms in the model which are combination of both L1 and
L2 regularization.
15/12/2024 School of Computer Science and Engineering 42
Support Vector
Machines
SVM - Introduction
-Support Vector Machine (SVM) is a supervised learning
algorithm developed by Vladimir Vapnik and it was first heard in
1992, introduced by Vapnik, Boser and Guyon in COLT-92.
-Support Vector Machine (SVM) is a relatively
simple Supervised Machine Learning Algorithm used for
classification and/or regression.
It is more preferred for classification but is sometimes very
useful for regression as well.
• Basically, SVM finds a hyper-plane that creates a boundary
between the types of data.
• The two results of each classifier will be :
• The data point belongs to that class OR
• The data point does not belong to that class.
SVM - Introduction
• The aim of a support vector machine algorithm is to find the best
possible line, or decision boundary, that separates the data points of
different data classes.
• This boundary is called a hyperplane when working in high-dimensional
feature spaces.
• The idea is to maximize the margin, which is the distance between the
hyperplane and the closest data points of each category, thus making it
easy to distinguish data classes.
• During the training phase, SVMs use a mathematical formulation to find
the optimal hyperplane in a higher-dimensional space, often called
the kernel space.
SVM - Introduction
-We are given a set of n points (vectors) x1 , x:2 ,.......xn
such that
xi is a vector of length m and each belong to one of two
classes we label them by “+1” and “-1”.
-So our training set is:( x1 , y1 ), ( x2 , y2 ),....( xn , yn ) So the decision
function will be
i xi R m , yi {1, 1} f ( x) sign( w x b)
- We want to find a separating hyperplane w.x +b =0
that separates these points into the two classes. “The positives” (class
“+1”) and “The negatives” (class “-1”). (Assuming that they are linearly
separable)
SVM – Seperating Hyperplane
x2
yi 1
yi 1 f ( x) sign( w x b)
A separating
hypreplane
w x b 0
x1
But There are many possibilities
for such hyperplanes !!
SVM – Seperating Hyperplane
yi 1
yi 1 Which one should we choose!
Yes, There are many possible separating hyperplanes
It could be this one or this or this or maybe….!
SVM – Choosing a separating hyperplane
-Suppose we choose the hypreplane (seen below) that is
close to some sample xi .
- Now suppose we have a new point x ' that should be in class
“-1” and is close to xi . Using our classification function f ( x )
this point is misclassified!
f ( x) sign( w x b)
Poor generalization! x'
xi
(Poor performance on
unseen data)
SVM – Choosing a separating hyperplane
-Hyperplane should be as far as possible from any sample point.
-This way a new data that is close to the old samples will be classified
correctly.
x'
Good generalization! xi
SVM – Choosing a separating hyperplane
The SVM approach: Linear separable
case
-The SVM idea is to maximize the distance between The hyperplane
and the closest sample point.
In the optimal hyper- plane:
The distance to The distance to
the closest = the closest
negative point positive point.
SVM – Choosing a separating hyperplane
The SVM approach: Linear separable
case
SVM’s goal is to maximize the Margin which is twice the distance “d”
between the separating hyperplane and the closest sample.
gin
ar
M
Why it is the best?
-Robust to outliners as
d
xi we saw and thus strong generalization
ability.
d
-It proved itself to have better
performance on test data in both
practice and in theory.
SVM – Choosing a separating hyperplane
The SVM approach: Linear separable
case
Support vectors are the samples closest to the separating hyperplane.
Oh! So this is where
the name came from!
gin
ar
These are
M
Support
d
Vectors xi
We will see latter that the
d
Optimal hyperplane is
completely defined by
the support vectors.
SVM – Margin Decision Boundary
• The decision boundary should be as far away from
the data of both classes as possible
• We should maximize the margin, m
• Distance between the origin and the line wtx=-b is b/||
w||
Class 2
m
Class 1
SVM - Example
SVM - Example
SVM - Example
SVM - Example
SVM - Example
SVM - Example
SVM - Example
SVM - Example
SVM - Example
Decision Tree and
Naïve Bayesian
Decision Tree
Issues Regarding Classification and
Prediction
Preparing the Data for Classification and Prediction
• Data cleaning: This refers to the preprocessing of data in order
to remove or reduce noise and the treatment of missing values.
• Relevance analysis: Many of the attributes in the data may be
redundant. Correlation analysis can be used to identify whether
any two given attributes are statistically related.
• Data transformation and reduction: The data may be
transformed by normalization, particularly when neural
networks or methods involving distance measurements are used
in the learning step. Normalization involves scaling all values for
a given attribute so that they fall within a small specified range,
12/15/2024 Classification and Prediction 66
Comparing Classification and Prediction
Methods
• Accuracy: The accuracy of a classifier refers to the ability of a
given classifier to correctly predict the class label of new or
previously unseen data (i.e., tuples without class label
information).
• Speed: This refers to the computational costs involved in
generating and using the given classifier or predictor
• Robustness: This is the ability of the classifier or predictor to
make correct predictions given noisy data or data with missing
values.
• Scalability: This refers to the ability to construct the classifier
12/15/2024
or predictor efficiently given large amounts of data.
Classification and Prediction 67
Classification by decision tree Induction
• Decision tree
• A flow-chart-like tree structure
• Internal node denotes a test on an attribute
• Branch represents an outcome of the test
• Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
• Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected
attributes
• Tree pruning
• Identify and remove branches that reflect noise or
outliers
• Use of decision tree: Classifying an unknown sample
• Test the attribute values of the sample against the
decision tree
Training Dataset
12/15/2024 Classification and Prediction 69
Algorithm
• Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer
manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are
discretized in advance)
• Examples are partitioned recursively based on selected
attributes
• Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
• There are no samples left
Attribute Selection Measure
• Information gain (ID3/C4.5)
• All attributes are assumed to be categorical
• Can be modified for continuous-valued attributes
• Select the attribute with the highest information gain
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and
n elements of class N
– The amount of information, needed to decide if an arbitrary
example in S belongs to P or N is defined as
p p n n
I ( p, n) log 2 log 2
pn pn pn pn
Information Gain in Decision Tree
Induction
• Assume that using attribute A a set S will be partitioned into
sets {S1, S2 , …, Sv}
• If Si contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classify
objects in all subtrees Si is
pi ni
E ( A) I ( pi , ni )
i 1 pn
• The encoding information that would be gained by branching
on A Gain( A) I ( p, n) E ( A)
Gain (A)
• In other words, Gain(A) tells us how much would be gained by
branching on A.
• It is the expected reduction in the information requirement
caused by knowing the value of A.
• The attribute A with the highest information gain, Gain(A), is
chosen as the splitting attribute at nodeN.
• This is equivalent to saying that we want to partition on the
attribute A that would do the “best classification,”
Gain( A) I ( p, n) E ( A)
12/15/2024 Classification and Prediction 73
Information Gain (A)
But how can we compute the information
gain of an attribute that is continuous
valued, unlike in the example?
Suppose, instead, that we have an attribute A that is continuous-
valued, rather than discrete-valued.
(For example, suppose that instead of the discretized version of age
from the example, we have the raw values for this attribute.)
For such a scenario, we must determine the “best” split-point for
A, where the split-point is a threshold on A.
12/15/2024 Classification and Prediction 74
Information Gain in Decision Tree
Induction
• The algorithm is called with three parameters: D, attribute list, and
Attribute selection method.
• We refer to D as a data partition. Initially, it is the complete set of
training tuples and their associated class labels.
• The parameter attribute list is a list of attributes describing the
tuples.
• Attribute selection method specifies a heuristic procedure for
selecting the attribute that “best” discriminates the given tuples
according to class.
12/15/2024 Classification and Prediction 75
Attribute Selection Measures
• Is a heuristic for selecting the splitting criterion that “best”
separates a given data partition, D, of class-labeled training tuples
into individual classes.
• If we were to split D into smaller partitions according to the
outcomes of the splitting criterion, ideally each partition would be
pure.
• Attribute selection measures are also known as splitting rules
because they determine how the tuples at a given node are to be
split.
12/15/2024 Classification and Prediction 76
Implementation
The class label attribute, buys computer, has two distinct values
(namely, yes, no)
Therefore, there are two distinct classes (that is, m = 2).
Let class C1 correspond to yes.
Let class C2 correspond to no.
There are
Nine tuples of class yes
Five tuples of class no.
A (root) node N is created for the tuples in D.
To find the splitting criterion for these tuples, we must compute
the information gain of each attribute.
Implementation
Information gain will be calculated using the following
formula
Gain(A) = Info(D) - InfoA(D)
Info(D) = - (p/p+n) log2(p/p+n) -
(n/p+n)log2(n/p+n)
v
infoA(D)= ∑ ( pi + ni/p+n )Info(pi,ni)
i=1
Implementation
Info(D) = -9/14 log2(9/14) –(5/14)log2(5/14)
= 0.409 + 0.530
= 0.934
Now,
-> gain for age,
I(p1,I1) = -(2/5) log2(2/5) – (3/5) log2(3/5) = 0.971
I(p2,I2) = -(4/4) log2(4/4) – (0/4) log2(0/4) = 0
I(p3,I3) = -(3/5) log2(3/5) – (2/5) log2(2/5) = 0.971
E(A) = (( 2+3 ) / (9+5 ) * 0.971 ) + (( 4+0 ) / (9+5 ) * 0 ) + (( 3+2 ) / (9+5 ) *
0.971 )
= 0.694
Gain(Age) = Info(D) -- E(A) = 0.934 – 0.694 = 0.240
Implementation
Now,
-> gain of income
I(p1,I1) = -(2/4) log2(2/4) – (2/4) log2(2/4) =1
I(p2,I2) = -(4/6) log2(4/6) – (2/6) log2(2/6) = 0.918
I(p3,I3) = -(3/4) log2(3/4) – (1/4) log2(1/4) = 0.8112
E(A) = (( 2+2 ) / (9+5 ) * 1 ) + (( 4+2) / (9+5 ) * 0.918 ) + (( 3+1 ) / (9+5 ) *
0.8112 )
= 0.910
Gain(Income) = Info(D) -- E(A) = 0.934 – 0.910 = 0.024
Implementation
Now,
-> gain of student
I(p1,I1) = -(6/7) log2(6/7) – (1/7) log2(1/7) = 0.591
I(p2,I2) = -(3/7) log2(3/7) – (4/7) log2(4/7) = 0.984
E(A) = (( 6+1 ) / (9+5 ) * 0.591 ) + (( 3+4) / (9+5 ) * 0.984 )
= 0.7875
Gain(Student) = Info(D) -- E(A) = 0.934 – 0.7875 = 0.147
Implementation
Now,
-> gain of credit limit
I(p1,I1) = -(6/8) log2(6/8) – (2/8) log2(2/8) =0.8112
I(p2,I2) = -(3/6) log2(3/6) – (3/6) log2(3/6) = 1
E(A) = (( 6+2 ) / (9+5 ) * 0.915 ) + (( 3+3) / (9+5 ) *1)
= 0.891
Gain(Credit_Limit) = Info(D) - E(A) = 0.934 – 0.891 = 0.042
Implementation
Since Gain(age) is maximum of all, So we choose Age as the
classifying attribute.
AGE
Age = youth Age = senior
Income Student Credit class Income Student Credit Class
High No Fair No Med No Fair Yes
High No Exclnt No Low Yes Fair Yes
Med No Fair No Low Yes Ecxlnt No
Low Yes Fair Yes Med Yes Fair Yes
Med Yes Exclnt Yes Med No Exclnt No
Age = mid_age
Class : C1
Implementation
AGE
Age = youth Age = senior
Student Credit_limit
Credit_limit Credit_limit
Student = Yes Student = No = fair = excellent
Class : C1 Class : C2 Class : C1 Class : C2
Age = mid_age
Class : C1
Bayesian Classifiers :
• “What are Bayesian classifiers?” Bayesian classifiers are statistical
classifiers.
• They can predict class membership probabilities, such as the
probability that a given tuple belongs to a particular class.
• Let X be a data tuple. In Bayesian terms, X is considered
“evidence.” As usual, it is described by measurements made on a
set of n attributes.
• Let H be some hypothesis, such as that the data tuple X belongs
to a specified class C.
• For classification problems, we want to determine P(H/X), the
probability that the hypothesis H holds given the
“evidence” or observed data tuple X.
12/15/2024 Classification and Prediction 85
• In other words, we are looking for the probability that tuple X
Bayesian Classifiers :
Naive Bayesian classifiers assume that the
effect of an attribute value on a given class is
independent of the values of the other attributes.
This assumption is called class conditional
independence.
Dependencies can exist between variables.
Bayesian belief networks specify joint
conditional probability distributions.
They allow class conditional independencies to be
defined between subsets of variables.
They provide a graphical model of causal
relationships, on which learning can be
12/15/2024 Classification and Prediction 86
Naïve Bayesian Example :
December 15, 2024 Data Mining: Concepts and Techniques 87
Naïve Bayesian Example :
December 15, 2024 Data Mining: Concepts and Techniques 88