ML Unit 2
ML Unit 2
SN0 TOPICS
PART-1 Introduction to classification.
Generative Classifiers:
Classifying with Bayesian decision theory, Bayes’ rule, Naïve
Bayes classifier.
PART-2 Discriminative Classifiers:
Logistic Regression
Decision Trees: Training and Visualizing a Decision Tree,
Making Predictions, Estimating Class Probabilities, The CART
Training Algorithm, Attribute selection measures- Gini
impurity.
Entropy, Regularization Hyperparameters, Regression Trees,
Linear Support vector machines.
1
Classification :
The Supervised Machine Learning algorithm two types.
Regression
Classification
In Regression algorithms, we have predicted the output for continuous
values, but to predict the categorical values, we need Classification
algorithms.
Best example of an ML
classification algorithm is Email
Spam Detector.
2
WHY DO WE USE CLASSIFICATION:
Classification is used to categorize data into predefined classes or groups,
helping in decision-making.
It simplifies complex data by assigning labels, making it easier to analyze.
Classification is widely used in tasks like spam detection, medical diagnosis, and
image recognition. It also helps in predicting outcomes based on patterns in
data.
3
There are two types of Classifications:
Binary Classifier:
If the classification problem has only two possible outcomes, then it is called as Binary
Classifier.
Examples:
YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-Class Classifier:
If a classification problem has more than two outcomes, then it is called as Multi-class
Classifier.
Example:
Classifications of types of crops, Classification of types of music
4
5
Types of ML Classification Algorithms:
Classification Algorithms can be further divided into the Mainly two category:
Linear Models
Non-linear Models
Linear Models
Logistic Regression
Support Vector Machines
Non-linear Models
K-Nearest Neighbours
Naive Bayes
Decision Tree Classification
Random Forest Classification
Support Vector Machines
6
CLASSIFIERS
The algorithm which implements the classification on a dataset is known as a
classifier.
Two broad categories of classification models in machine learning.
Generative Classifiers:
Definition:
Generative classifiers model the conditional probability P(c/x), where x represents
the features and c is the class label.
These classifiers learn how the data is generated for each class, which allows them to not
only perform classification but also generate new data samples by sampling from the
learned or trained data.
7
Discriminative Classifiers:
Definition:
Discriminative classifiers model the probability P(x), which is the
probability of the class label given the observed features.
They focus on directly learning the boundary between classes, without
modeling how the data is generated.
Types of Discriminative Classifiers:
Logistic Regression
Support Vector Machines (SVM)
Decision Trees
Random Forests
k-Nearest Neighbors(k-NN)
Neural Networks (Deep Learning)
8
Generative Classifiers
Classifying with Bayesian decision theory
Bayes’ rule
Naïve Bayes classifier.
9
Bayesian Decision Theory
Key Concepts:
Prior Probability: Initial belief about the likelihood of a class.
Likelihood: How likely observe a particular feature given the class.
Posterior Probability: Updated belief after seeing the data, calculated using Bayes’
rule.
Decision Boundaries: Set where the expected losses of two decisions are equal.
10
Bayes’ rule
Step1: write the conditional probability formula
11
Step3: By commutative law
baye's rule
12
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
The formula for Bayes' theorem is given as:
13
Naïve Bayes classifier’s:
The naive Bayes classifier applies to learning tasks where each instance x is described by a
conjunction of attribute values and where the target function f ( x ) can take on any
value from some finite set V.
To separate the classes uses Naïve Baye’s Classifier.
It uses the categorial values .
14
Where VNB denotes the target
value output by the naive
Bayes classifier.
argmax is most commonly used in machine learning for finding the class
with the largest predicted probability.
OUTLOOK TEMP
OUTLOOK YES NO TEMP YES NO
SUNNY 2/9 3/5 HOT 2/9 2/5
RAIN 3/9 2/5 MILD 4/9 2/5
OVERCAST 4/9 0/5 WIND COOL 3/9 1/5 HUMIDITY
WIND YES NO HUMIDITY YES NO
STRONG 6/9 2/5 HIGH 6/9 2/5
WEAK 3/9 3/5 NORMAL 3/9 3/5
17
STEP3: To test the new data Instance apply condition probability on the
training set.
TEST DATA: (outlook=sunny , Temp=cool , Humidity=High , Wind=Strong)
FOR YES:
VNB =ARGMAX(P(PLAY-TENNIS=YES)*(P(OUTLOOK=SUNNY/YES))*(P(TEMP=COOL/YES))*
(P(HUMIDITY=HIGH/YES))* (P(WIND=STRONG/YES))
VNB =0.64 * 2/9 * 3/9 * 3/9 *3/9
VNB (YES)=0.005
FOR NO:
VNB =ARGMAX(P(PLAY-TENNIS=NO)*(P(OUTLOOK=SUNNY/NO))*(P(TEMP=COOL/NO))*
(P(HUMIDITY=HIGH/NO))* (P(WIND=STRONG/NO))
VNB =0.36 * 3/5 * 1/5 * 4/5 *3/5
VNB (YES)=0.0207 18
STEP4: Now, use Bayes theorem to calculate the normalized posterior probability.
FOR YES:
VNB(YES) =
P(Yes)>P(No)
0.17>0.69 ( false)
= Hence on a day, Player cannot play
the game.
FOR NO:
VNB(NO) =
19
Example2:
20
Step1: Calculate the prior probabilities ( for target values).
Samples=10
p(survived|yes) =4/10 0.4
P(survived|no) = 6/10 0.6
CLASS AGE
CLASS YES NO AGE YES NO
FIRST 2/4 1/6 ADULT 2/4 6/6
SECOND 1/4 2/6 CHILD 2/4 0
SEX
THIRD 2/4 1/6
SEX YES NO
CREW 0 1/6
MALE 0 5/6
FEMALE 2/4 0
21
STEP3: Test the data / new Instance / or / condition apply to train set.
FOR YES:
VNB =ARGMAX(P(SURVIVED|YES)*(P(class=third ||YES))*(P(sex=male ||YES))*
(P(age=child ||YES))
VNB =0.4 * 2/4 * 0 * 2/4
VNB (YES)=0
FOR NO:
VNB =ARGMAX(P(SURVIVED|NO)*(P(class=third ||NO))*(P(sex=male ||NO))*
(P(age=child ||NO))
VNB =0.6* 1/6 * 5/6 * 0
VNB (YES)=0 22
STEP4: Now, use Bayes theorem to calculate the posterior probability.
FOR NO:
FOR YES:
VNB(NO) =
VNB(YES) =
=
=
23
K=1
p( sex=male|yes)= =1/6 p( age=child|no)= =1/8
Replace the instances and find the calculate the posterior probability.
FOR NO:
VNB =0.6* 2/4 * 5/4 * 1/8
VNB (YES)=0.021
VNB(NO) = 24
FOR YES:
VNB =0.4 * 1/4 * 1/6 * 2/4
VNB (YES)=0.008
VNB(YES) = =
Result:
26
Disadvantages of Naïve Bayes Classifier:
Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features.
27
Discriminative Classifiers
Logistic Regression
Decision Trees: Training and Visualizing a
Decision Tree, Making Predictions, Estimating
Class Probabilities, The CART Training Algorithm,
Attribute selection measures- Gini impurity.
Entropy.
Regularization Hyperparameters, Regression
Trees, Linear Support vector machines.
28
LOGISTIC REGRESSION:
Logistic Regression:
Logistic regression is a type of regression used for classification problems, where the
output variable is categorical in nature. Logistic regression uses a logistic function
to predict the probability of the input belonging to a particular category.
OR
Logistic regression (logit regression) is a type of regression analysis used for predicting
the outcome of a categorical dependent variable.
29
What is the Sigmoid Function:
It is a mathematical function having a characteristic that can take any real value and
map it to between 0 to 1 shaped like the letter “S”. The sigmoid function also called a
logistic function. Y = 1 / 1+e -z.
30
As shown above, the figure sigmoid function converts the continuous variable data into
the probability i.e. between 0 and 1.
σ(z) σ(z) tends towards 1 as z→0 & z→∞
σ(z) σ(z) tends towards 0 as z→−∞ & z→0
σ(z) σ(z) is always bounded between 0 and 1
Advantages Disadvantages
It can easily extend to multiple The major limitation of Logistic
classes(multinomial regression) and a Regression is the assumption of linearity
natural probabilistic view of class between the dependent variable and the
predictions. independent variables.
31
Logistic Regression Equation:
The odd is the ratio of something occurring to something not occurring.
it is different from probability as the probability is the ratio of something
occurring to everything that could possibly occur. so odd will be:
odd’s(x)=
Odd(x) = p=probability .
The odds(x) success of equal to exponential linear regression (z)
32
Apply the Z=w . X + b
33
Then the logistic regression equation will be:
Ex: Based on the entrance marks who are select or not. B0=1 , b1=8 , x=60.
let Y = B0=1 , b1=8 , x=60
Y= => 1
1 member is selected.
34
Example:
35
Calculate the probability of pass for the student who studied 33 hours
Hours=33
The linear regression formula: Final logistic regression equation:
Y=
Y=
z=
=>
0.88
36
Atleast how many hours students should study that makes he will pass the course with
probability of more than 95%.
Hours=?
The linear regression formula: 0.95=
0.95* (
Y=
Z= Z=
Y= 0.95* (
37
0.95 e-z =1-0.95
-64+2*hours = 2.94
e-z = 2*hours=2.94+64
e-z=0.053 2*hours=66.94
e-z )=ln(0.053)
ln( Hours=33.47
Hence:
-z= -2.937
Approximatly 33 hours students
Both sides – is common to remove. should study that makes he will pass
the course with probability of more 38
Logistic Regression is mainly divided categories :
1)Estimating Probabilities
3)Decision Boundaries
4)Soft-max Regression.
39
1) Estimating Probabilities:
the Logistic Regression utilizes a more sophisticated cost function, which is known as the
“Sigmoid function” or “logistic function” instead of a linear function.
The logistic—noted σ(·)—is a sigmoid function (i.e., S-shaped) that outputs a number
between 0 and 1.
40
Logistic function
41
2. Training and Cost Function:
Training:
The model is trained using labeled data, optimizing the weights w and bias b to
minimize the cost function.
The cost function over the whole training set is simply the average cost over all training
instances called the log loss.
42
3. Decision Boundaries
The predicted probability is compared to a threshold to set the class of a data point.
Decision boundary formula
In a simple binary logistic regression model with two features, the decision boundary
formula is: 0 = b0 + b1x1 + b2x2.
In this formula, b0, b1, and b2 are the model parameters, and x1 and x2 are the two
features.
You can create a decision boundary by fitting a model on the training dataset, then using
the model to make predictions for a set of data values.
3. Decision Boundaries
Estimated probabilities and decision boundary:
The petal width of Iris-Virginica flowers (represented by triangles) ranges from 1.4 cm
to 2.5 cm, while the other iris flowers (represented by squares) generally have a smaller
petal width, ranging from 1.0 cm to 1.8 cm.
Therefore, there is a decision boundary at around 1.6 cm where both probabilities are
equal to 50%. Marginal distance distance between two support vectors lines to the
boundary line . P(y)=1/x >0.5 , else < 0.5 45
4. Softmax Regression:
The Logistic Regression model can be generalized to support multiple classes
directly, without having to train and combine multiple binary classifiers.
This is called Softmax Regression, or Multinomial Logistic Regression.
Softmax Score for class k:
Multinomial Logistic Regression:
Target variable can have 3 or more possible types
Softmax Function: which are not ordered (i.e. types have no
quantitative significance) like “disease A” vs
“disease B” vs “disease C”.
46
PART2
DECISION TREES
The CART Training Algorithm
Attribute selection measures- Gini impurity,
Entropy.
Training and Visualizing a Decision Tree
Making Predictions
Estimating Class Probabilities
Regularization Hyperparameters, Regression
Trees, Linear Support vector machines.
47
DECISION TREE CLASSIFICATION:
Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
It is a tree-structured classifier,
Internal nodes represent the features of a dataset,
Branches represent the decision rules and
Each leaf node represents the outcome.
Leaf node is called the purest node.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
48
Types of tree algorithms:
In order to build a tree,
we use the CART algorithm, which stands for Classification and Regression
Tree algorithm.
Iterative Dichotomiser-3 ( ID3) algorithm.
FLOW CHART:
49
Example
50
Why use Decision Trees?
Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
51
Pruning:
Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
53
Entropy:
It measures the amount of uncertainity or impurity of dataset.
Information Gain:
It measures the reduction of an entropy after data set is splitting on an attribute.
Information gain:
Information gained = Entropyparent - * Entropy(Di))
Di is the subset of D after splitting an attribute
Note:
• The value of positive and negative are equal then the entropy is directly 1.0
• The value of either positive or negative is 0 then entropy is 0.0
Entropy and Information gain of whole dataset
Entropy ([9+,5-]) = -())
Outlook Yes No
= 0.940
Sunny 2 3
Information gain = Entropyparent - * Entropy(Di))
Overcast 4 0
Rain 3 2
EntropySunny ([2+,3-]) = -()) = 0.971
EntropyOvercast ([4+,0-]) = -() = 0.0
EntropyRain ([3+,2-]) = -()) = 0.971
Information gained for Outlook = Entropy(S) - * Entropy(Sv))
= Entropy(S) -( * Entropy(Ssunny) -( * Entropy(Sovercast) -( * Entropy(Srain)
= 0.940 - * 0.971 - * 0.0 - * 0.971
= 0.2464
Entropy and Information gain of Temperature
Strong 3 3
Maximum Information Gained is “ Outlook “
Therefore, Outlook comes as root node
Decision Tree after finding information gain for the whole dataset
As overcast gives purest node, we consider overcast to be Yes and there is no need
of splitting for Overcast…
Outlook
Sunny Rain
Overcast
Yes
After splitting whole dataset…
Now, we need to split Sunny and Rain since they are still impure nodes.
For getting the purest nodes for both sunny and rain, we need to repeat the same process for the
attributes humidity, temperature and wind.
Based on the maximum information gain, we place the internal nodes as either Humidity or
Temperature or Wind.
We check until the purest nodes enter in the decision tree and reach the leaf nodes.
Outlook
64
Problems on Decision Trees:
Consider the following set of training examples:
a) What is the entropy of this collection of training examples with respect to the target function
classification?
b) What is the information gain of a2 relative to these training examples.
65
CART
ALGORITHM
Classification and Regression Trees (CART) is a decision tree algorithm that is used for
both classification and regression tasks.
It is a supervised learning algorithm that learns from labelled data to predict the outcome
of unseen data.
The CART algorithm is a type of classification algorithm that is required to build a
decision tree on the basis of Gini's impurity index.
It works by recursively partitioning the data into smaller and smaller subsets
based on certain criteria(condition).
The goal is to create a tree structure that can accurately classify or predict the
target value for new data points.
Mathematically, we can write Gini Impurity as follows:
Gini=1−∑(pi)2
Here pi indicates the probability of class i. 66
The CART algorithm uses following systematic approach to build decision trees.
1. Feature Selection:
•Start by evaluating each feature's ability to classify the data set effectively.
•Measure the impurity of attributes using metrics like the Gini Index for classification or
mean squared error for regression.
•Choose the feature and the split point that results in the most significant reduction in
impurity.
2. Splitting:
CART uses a greedy approach to split the data at each node
It evaluates all possible splits and selects the one that best reduces the impurity of the
resulting subsets
•Once the best feature is identified, split the data set into various subsets based on this
attribute..
•This creates child nodes, each representing a subset of the data set based on the selected
feature's value.
3. Tree Building:
• Tree can be created by recursively apply the above two steps for each child node,
considering only the subset of data within that node.
•Continue this process until a stopping criterion is met, such as a maximum tree depth or a
minimum number of samples in a node.
4. Tree Pruning:
•When the full tree is created, the pruning process begins. 67
•
Advantages of CART
Results are simplistic.
Classification and regression trees are Nonparametric and Nonlinear.
Classification and regression trees implicitly perform feature selection.
Outliers have no meaningful effect on CART.
It requires minimal supervision and produces easy-to-understand models.
Limitations of CART
Overfitting.
High Variance.
low bias.
the tree structure may be unstable.
Applications of the CART algorithm
For quick Data insights.
In Blood Donors Classification.
For environmental and ecological data.
In the financial sectors.
68
Examples:
Day Outlook Temp Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rainy Mild High Weak Yes
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rainy Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rainy Mild High Strong No
69
Outlook Step1: calculate the gini index for each attribute
Number of Gini Index Calculations:
Outlook Yes No Gini(Outlook=Sunny) = 1 - (2/5)^2 - (3/5)^2 = 0.48
instances
Sunny 2 3 5 Gini(Outlook=Overcast) = 1 - (4/4)^2 - (0/4)^2 = 0
Overcast 4 0 4 Gini(Outlook=Rain) = 1 - (3/5)^2 - (2/5)^2 = 0.48
Rain 3 2 5 Weighted Gini Index for Outlook:
Gini(Outlook) = (5/14) * 0.48 + (4/14) * 0 + (5/14) * 0.48 = 0.342
Temperature
Gini Index Calculations:
Temperatu Number of
re
Yes No
instances Gini(Temp=Hot) = 1 - (2/4)^2 - (2/4)^2 = 0.5
Gini(Temp=Cool) = 1 - (3/4)^2 - (1/4)^2 = 0.375
Hot 2 2 4 Gini(Temp=Mild) = 1 - (4/6)^2 - (2/6)^2 = 0.445
Cool 3 1 4 Weighted Gini Index for Temperature:
Mild 4 2 6 Gini(Temp) = (4/14) * 0.5 + (4/14) * 0.375 + (6/14) * 0.445 =
0.439
Humidity
Gini Index Calculations:
Number of Gini(Humidity=High) = 1 - (3/7)^2 - (4/7)^2 = 0.489
Humidity Yes No
instances
Gini(Humidity=Normal) = 1 - (6/7)^2 - (1/7)^2 = 0.244
High 3 4 7
Weighted Gini Index for Humidity:
Normal 6 1 7 Gini(Humidity) = (7/14) * 0.489 + (7/14) * 0.244 = 0.367
70
Wind
Number of Gini Index Calculations:
Wind Yes No instances Gini(Wind=Weak) = 1 - (6/8)^2 - (2/8)^2 = 0.375
Gini(Wind=Strong) = 1 - (3/6)^2 - (3/6)^2 = 0.5
Weak 6 2 8
Weighted Gini Index for Wind:
Strong 3 3 6
Gini(Wind) = (8/14) * 0.375 + (6/14) * 0.5 = 0.428
We've calculated gini index values for each feature. We will select outlook feature
because its cost is the lowest.
Feature Gini index
Outlook 0.342
Temperature 0.439
Humidity 0.367
Wind 0.428
71
72
Sunny dataset
Day Outlook Temp. Humidity Wind Decision
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes
77
78
EXAMPLE OF CART ALGORITHM:
Leaf
node True False
Leaf
Leaf
node
node
79
Regression Trees
Regression Trees are a type of decision tree used when the target variable is continuous
(numerical).
To calculate the regression tress dataset. formula for standard deviation (SD)
80
EXAMPLE:
81
EXAMPLE:
82
Step1:
Compute the standard deviation from all target values.
For result:
Average==75
=
= 15.69
83
Step 2: Calculate the sd for each feature.
Let Attribute:
Let Assessment==good:
=
= 9.74
Let Assessment==Average:
=
= 8.9999
84
Assessment==Poor:
= 10
85
Let Attribute
Assignment =YES
= 13.40
Assignment =NO
= 13.14
The weighted standard deviation is:
wsd=
Weighted reduced standard deviation RSD=15.69-13.27
= 2.42
86
Let Attribute
Project=YES
= 11.54
Project =NO
= 11.59
The weighted standard deviation is:
wsd=
Weighted reduced standard deviation RSD=15.69-13.27
= 4.13
87
RSD
According to highest Reduced Standard Deviation(RSD) ,
Assessment 6.12 we have to select and make root node:
Assignment 2.42
Project 4.13 DECISION TREE
88
Now, The root node is declared , we have to
continue the process until tree is formed.
Assessment==good:
=9.74
Let Attribute:
Assignment==yes Assignment==no
Average=(95+98+89) ===94 Average=(75+75) === 75
Sd=3.74 Sd=0
89
The weighted standard deviation is:
wsd= 0
Weighted reduced standard deviation RSD=9.74-2.244
= 7.496
Let Attribute:
Assignment==yes Assignment==no
90
RSD
Assignment 7.49 According to highest Reduced Standard Deviation(RSD) ,
Project 2.44 we have to select and child node:
DECISION TREE
Project No==58,70
91
FINAL TREE ASSESSMENT
ASSIGNMENT
PROJECT YES NO
65
YES NO 45
NO YES
93
94
ANSWER
Disadvantages of the Decision Tree
The decision tree contains lots of layers, which makes it complex.
It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
For more class labels, the computational complexity of the decision tree may
increase.
95
Support Vector Machine:
SVMs are particularly well suited for classification of complex but small- or medium-
sized datasets.
A Support Vector Machine (SVM) performs classification by finding the hyperplane
that maximizes the margin between the two classes.
A hyperplane is a decision boundary that differentiates the two classes in SVM.
96
GRAPH FOR LINEAR AND NON LINEAR PROBLEMS:
97
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
98
Types of Support Vector Machines:
Based on the nature of the decision boundary, Support Vector Machines (SVM) can be
divided into two main parts:
Linear SVM: Linear SVMs use a linear decision boundary to separate the data points
of different classes. When the data can be precisely linearly separated, linear SVMs are
very suitable. This means that a single straight line (in 2D) or a hyperplane (in higher
dimensions) can entirely divide the data points into their respective classes. A
hyperplane that maximizes the margin between the classes is the decision boundary.
Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be
separated into two classes by a straight line (in the case of 2D). By using kernel
functions, nonlinear SVMs can handle nonlinearly separable data. The original input
data is transformed by these kernel functions into a higher-dimensional feature space,
where the data points can be linearly separated. A linear SVM is used to locate a
nonlinear decision boundary in this modified space.
99
1) Linear SVM Classification
103
104
Hard Margin vs Soft Margin in
SVM
Criteria Hard Margin Soft Margin
Objective Maximize margin, minimize
Function Maximize margin. margin violations.
Sensitive, requires perfectly Robust, handles noisy data with
Handling Noise linearly separable data margin violations.
Not applicable, no Controlled by regularization
Regularization regularization parameter parameter C.
Simple, computationally May require more
Complexity efficient computational resources
105
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line.
z=x2 +y2
By adding the third dimension, the sample space will
become as below image:
106
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
107