Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
18 views107 pages

ML Unit 2

Unit-2 covers classification in machine learning, focusing on generative and discriminative classifiers, including algorithms like Naïve Bayes, logistic regression, and decision trees. It explains the importance of classification for categorizing data, decision-making, and various applications such as spam detection and medical diagnosis. The document also details Bayesian decision theory, the Naïve Bayes classifier, and the logistic regression model, including their advantages and disadvantages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views107 pages

ML Unit 2

Unit-2 covers classification in machine learning, focusing on generative and discriminative classifiers, including algorithms like Naïve Bayes, logistic regression, and decision trees. It explains the importance of classification for categorizing data, decision-making, and various applications such as spam detection and medical diagnosis. The document also details Bayesian decision theory, the Naïve Bayes classifier, and the logistic regression model, including their advantages and disadvantages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 107

TOPICS COVERED IN UNIT-2

SN0 TOPICS
PART-1 Introduction to classification.
Generative Classifiers:
Classifying with Bayesian decision theory, Bayes’ rule, Naïve
Bayes classifier.
PART-2 Discriminative Classifiers:
Logistic Regression
Decision Trees: Training and Visualizing a Decision Tree,
Making Predictions, Estimating Class Probabilities, The CART
Training Algorithm, Attribute selection measures- Gini
impurity.
Entropy, Regularization Hyperparameters, Regression Trees,
Linear Support vector machines.

1
Classification :
 The Supervised Machine Learning algorithm two types.
Regression
Classification
 In Regression algorithms, we have predicted the output for continuous
values, but to predict the categorical values, we need Classification
algorithms.

y=f(x), where y = categorical output (0,1)

 Best example of an ML
classification algorithm is Email
Spam Detector.

2
WHY DO WE USE CLASSIFICATION:
 Classification is used to categorize data into predefined classes or groups,
helping in decision-making.
 It simplifies complex data by assigning labels, making it easier to analyze.
 Classification is widely used in tasks like spam detection, medical diagnosis, and
image recognition. It also helps in predicting outcomes based on patterns in
data.

3
There are two types of Classifications:

Binary Classifier:
If the classification problem has only two possible outcomes, then it is called as Binary
Classifier.
Examples:
YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-Class Classifier:
If a classification problem has more than two outcomes, then it is called as Multi-class
Classifier.
Example:
Classifications of types of crops, Classification of types of music

4
5
Types of ML Classification Algorithms:
Classification Algorithms can be further divided into the Mainly two category:
 Linear Models
 Non-linear Models

Linear Models
 Logistic Regression
 Support Vector Machines
Non-linear Models
 K-Nearest Neighbours
 Naive Bayes
 Decision Tree Classification
 Random Forest Classification
 Support Vector Machines

6
CLASSIFIERS
The algorithm which implements the classification on a dataset is known as a
classifier.
 Two broad categories of classification models in machine learning.
Generative Classifiers:
 Definition:
Generative classifiers model the conditional probability P(c/x), where x represents
the features and c is the class label.
These classifiers learn how the data is generated for each class, which allows them to not
only perform classification but also generate new data samples by sampling from the
learned or trained data.

 Types of Generative Classifiers:


 Naïve Bayes Classifier
 Gaussian Mixture Models (GMM)

7
Discriminative Classifiers:
Definition:
 Discriminative classifiers model the probability P(x), which is the
probability of the class label given the observed features.
 They focus on directly learning the boundary between classes, without
modeling how the data is generated.
Types of Discriminative Classifiers:
 Logistic Regression
 Support Vector Machines (SVM)
 Decision Trees
 Random Forests
 k-Nearest Neighbors(k-NN)
 Neural Networks (Deep Learning)

8
Generative Classifiers
 Classifying with Bayesian decision theory
 Bayes’ rule
 Naïve Bayes classifier.

9
Bayesian Decision Theory

Bayesian Decision Theory


 Framework for decision-making under uncertainty.
 Involves making decisions based on probabilities and minimizing risk (expected loss).

 Key Concepts:
 Prior Probability: Initial belief about the likelihood of a class.
 Likelihood: How likely observe a particular feature given the class.
 Posterior Probability: Updated belief after seeing the data, calculated using Bayes’
rule.
 Decision Boundaries: Set where the expected losses of two decisions are equal.

10
Bayes’ rule
Step1: write the conditional probability formula

Step2: combine the probabilities :

11
Step3: By commutative law

baye's rule

12
 Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
 The formula for Bayes' theorem is given as:

 P(h∣D) is the posterior probability of event A


given event B.
Where: or  (D∣h) is the likelihood of event B given event
 P(h|D) is Posterior probability A.
 P(D|h) is Likelihood probability  P(h) is the prior probability of event A.
 P(h) is Prior Probability  P(D) is the total probability of event B.
 P(D) is Marginal Probability

13
Naïve Bayes classifier’s:
 The naive Bayes classifier applies to learning tasks where each instance x is described by a
conjunction of attribute values and where the target function f ( x ) can take on any
value from some finite set V.
 To separate the classes uses Naïve Baye’s Classifier.
 It uses the categorial values .

14
Where VNB denotes the target
value output by the naive
Bayes classifier.

 argmax is most commonly used in machine learning for finding the class
with the largest predicted probability.

Steps of Naïve Bayes' Classifier:


 Convert the given dataset into frequency tables.

Calculate the prior probabilities ( for target values).


 Generate Likelihood table by finding the probabilities of given features.

Calculate the current probabilities for individual attributes.


 Test the data / new Instance based on condition probability apply to train set.
 Now, use Bayes theorem to calculate the posterior probability.
15
Example:
Day Outlook Temp Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
Problem:
D2 Sunny Hot High Strong No (outlook=sunny , Temp=cool ,
D3 Overcast Hot High Weak Yes Humidity=High , Wind=Strong)
D4 Rain Mild High Weak Yes Predict:
D5 Rain Cool Normal Weak Yes Yes or NO
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
16
Step1: Calculate the prior probabilities ( for target values).
Samples=14
p(play|tennis) =yes=9/14  0.64
P(play|tennis) = No=5/14 0.36

Step2: Calculate the current probabilities for individual attributes.

OUTLOOK TEMP
OUTLOOK YES NO TEMP YES NO
SUNNY 2/9 3/5 HOT 2/9 2/5
RAIN 3/9 2/5 MILD 4/9 2/5
OVERCAST 4/9 0/5 WIND COOL 3/9 1/5 HUMIDITY
WIND YES NO HUMIDITY YES NO
STRONG 6/9 2/5 HIGH 6/9 2/5
WEAK 3/9 3/5 NORMAL 3/9 3/5

17
STEP3: To test the new data Instance apply condition probability on the
training set.
TEST DATA: (outlook=sunny , Temp=cool , Humidity=High , Wind=Strong)

FOR YES:
VNB =ARGMAX(P(PLAY-TENNIS=YES)*(P(OUTLOOK=SUNNY/YES))*(P(TEMP=COOL/YES))*
(P(HUMIDITY=HIGH/YES))* (P(WIND=STRONG/YES))
VNB =0.64 * 2/9 * 3/9 * 3/9 *3/9
VNB (YES)=0.005
FOR NO:
VNB =ARGMAX(P(PLAY-TENNIS=NO)*(P(OUTLOOK=SUNNY/NO))*(P(TEMP=COOL/NO))*
(P(HUMIDITY=HIGH/NO))* (P(WIND=STRONG/NO))
VNB =0.36 * 3/5 * 1/5 * 4/5 *3/5
VNB (YES)=0.0207 18
STEP4: Now, use Bayes theorem to calculate the normalized posterior probability.

FOR YES:
VNB(YES) =
 P(Yes)>P(No)
 0.17>0.69 ( false)
=  Hence on a day, Player cannot play
the game.
FOR NO:
VNB(NO) =

19
Example2:

Day Class Age sex Survived


1 First adult Male NO Problem:
2 Third child Female YES (class=third , sex=male ,
3 Second Adult Female YES age=child )
4 Third Adult Male NO Predict:
5 Crew Adult Male NO Yes or NO
6 First Adult Female YES
7 Second Child Female YES
8 Third Adult Female NO
9 Second Adult Male NO
10 First Adult Male NO

20
Step1: Calculate the prior probabilities ( for target values).
Samples=10
p(survived|yes) =4/10  0.4
P(survived|no) = 6/10 0.6

Step2: Calculate the current probabilities for individual attributes.

CLASS AGE
CLASS YES NO AGE YES NO
FIRST 2/4 1/6 ADULT 2/4 6/6
SECOND 1/4 2/6 CHILD 2/4 0
SEX
THIRD 2/4 1/6
SEX YES NO
CREW 0 1/6
MALE 0 5/6
FEMALE 2/4 0

21
STEP3: Test the data / new Instance / or / condition apply to train set.

TEST DATA: (class=third , sex=male , age=child )

FOR YES:
VNB =ARGMAX(P(SURVIVED|YES)*(P(class=third ||YES))*(P(sex=male ||YES))*
(P(age=child ||YES))
VNB =0.4 * 2/4 * 0 * 2/4
VNB (YES)=0
FOR NO:
VNB =ARGMAX(P(SURVIVED|NO)*(P(class=third ||NO))*(P(sex=male ||NO))*
(P(age=child ||NO))
VNB =0.6* 1/6 * 5/6 * 0
VNB (YES)=0 22
STEP4: Now, use Bayes theorem to calculate the posterior probability.

FOR NO:
FOR YES:
VNB(NO) =
VNB(YES) =
=
=

If the probability is zero we have do it.


It is Zero Probability Error.
To resolve this error we go for LaplaceTransformation (or) Laplace Smoothing .
 C(x/y) is no of times attribute value is appeared in a dataset.
 K Represents the Smoothing parameters should be greater no ‘0’
 X represents the no of features values in data
 c(Y) represents the no of yes samples or the numbers of no samples.

23
K=1
p( sex=male|yes)= =1/6 p( age=child|no)= =1/8
Replace the instances and find the calculate the posterior probability.
FOR NO:
VNB =0.6* 2/4 * 5/4 * 1/8

VNB (YES)=0.021

VNB(NO) =  24
FOR YES:
VNB =0.4 * 1/4 * 1/6 * 2/4

VNB (YES)=0.008
VNB(YES) = =

Result:

Based upon both P(NO)>P(Yes)


 25
Example3:
OUTPUT:

26
Disadvantages of Naïve Bayes Classifier:
 Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features.

Types of Naïve Bayes Model:


There are three types of Naive Bayes Model, which are given below:
Gaussian
 It use for continuous data .
Multinomial
 It used for count data( word Frequency)
Bernoulli
 It used for binary data.

27
Discriminative Classifiers
 Logistic Regression
 Decision Trees: Training and Visualizing a
Decision Tree, Making Predictions, Estimating
Class Probabilities, The CART Training Algorithm,
Attribute selection measures- Gini impurity.
Entropy.
 Regularization Hyperparameters, Regression
Trees, Linear Support vector machines.
28
LOGISTIC REGRESSION:
Logistic Regression:
 Logistic regression is a type of regression used for classification problems, where the
output variable is categorical in nature. Logistic regression uses a logistic function
to predict the probability of the input belonging to a particular category.

OR
 Logistic regression (logit regression) is a type of regression analysis used for predicting
the outcome of a categorical dependent variable.

 In logistic regression, dependent variable (Y) is binary (0,1) and independent


variables (X) are continuous in nature.

29
What is the Sigmoid Function:
 It is a mathematical function having a characteristic that can take any real value and
map it to between 0 to 1 shaped like the letter “S”. The sigmoid function also called a
logistic function. Y = 1 / 1+e -z.
30
As shown above, the figure sigmoid function converts the continuous variable data into
the probability i.e. between 0 and 1.
 σ(z) σ(z) tends towards 1 as z→0 & z→∞
 σ(z) σ(z) tends towards 0 as z→−∞ & z→0
 σ(z) σ(z) is always bounded between 0 and 1

Advantages Disadvantages
It can easily extend to multiple The major limitation of Logistic
classes(multinomial regression) and a Regression is the assumption of linearity
natural probabilistic view of class between the dependent variable and the
predictions. independent variables.

To avoid Zero probability error It constructs linear boundaries


Regularization of independent Features.

31
Logistic Regression Equation:
The odd is the ratio of something occurring to something not occurring.
it is different from probability as the probability is the ratio of something
occurring to everything that could possibly occur. so odd will be:

odd’s(x)=

Odd(x) = p=probability .
The odds(x) success of equal to exponential linear regression (z)

 Applying natural log on odd. then log odd will be:

32
 Apply the Z=w . X + b

33
Then the logistic regression equation will be:

Final logistic regression equation:

Ex: Based on the entrance marks who are select or not. B0=1 , b1=8 , x=60.
let Y = B0=1 , b1=8 , x=60
Y= => 1

1 member is selected.

34
Example:

35
 Calculate the probability of pass for the student who studied 33 hours

Hours=33
The linear regression formula: Final logistic regression equation:
Y=
Y=
z=

=>

 0.88

36
Atleast how many hours students should study that makes he will pass the course with
probability of more than 95%.

Hours=?
The linear regression formula: 0.95=

0.95* (
Y=

Z= Z=

Y= 0.95* (

0.95 + 0.95 e-z =1


Y=probability =95%  0.95
0.95 e-z =1-0.95
0.95=

37
0.95 e-z =1-0.95
-64+2*hours = 2.94

e-z = 2*hours=2.94+64

e-z=0.053 2*hours=66.94

Apply both sides log function. Hours=

e-z )=ln(0.053)
ln( Hours=33.47

Hence:

-z= -2.937 
Approximatly 33 hours students

Both sides – is common to remove. should study that makes he will pass
the course with probability of more 38
Logistic Regression is mainly divided categories :

1)Estimating Probabilities

2)Training and Cost Function

3)Decision Boundaries

4)Soft-max Regression.

39
1) Estimating Probabilities:
 the Logistic Regression utilizes a more sophisticated cost function, which is known as the
“Sigmoid function” or “logistic function” instead of a linear function.

 Logistic Regression model estimated probability (vectorized form).

The logistic—noted σ(·)—is a sigmoid function (i.e., S-shaped) that outputs a number
between 0 and 1.

Logistic Regression model prediction:

40
Logistic function

41
2. Training and Cost Function:
 Training:
 The model is trained using labeled data, optimizing the weights w and bias b to
minimize the cost function.

Cost function of a single training instance:

 The cost function over the whole training set is simply the average cost over all training
instances called the log loss.

Logistic Regression cost function (log loss) :

42
3. Decision Boundaries

In logistic regression, a decision boundary is a line or curve that separates classes in an


input space. It is the set of points where the probability of an instance belonging to a class
is 0.5.
The decision boundary is based on a logistic function of a linear combination of features.

The decision boundary specifies high probability so that the output is 1.

The predicted probability is compared to a threshold to set the class of a data point.
Decision boundary formula
In a simple binary logistic regression model with two features, the decision boundary
formula is: 0 = b0 + b1x1 + b2x2.
In this formula, b0, b1, and b2 are the model parameters, and x1 and x2 are the two
features.
You can create a decision boundary by fitting a model on the training dataset, then using
the model to make predictions for a set of data values.
3. Decision Boundaries
 Estimated probabilities and decision boundary:

 The petal width of Iris-Virginica flowers (represented by triangles) ranges from 1.4 cm
to 2.5 cm, while the other iris flowers (represented by squares) generally have a smaller
petal width, ranging from 1.0 cm to 1.8 cm.
 Therefore, there is a decision boundary at around 1.6 cm where both probabilities are
equal to 50%. Marginal distance distance between two support vectors lines to the
boundary line . P(y)=1/x  >0.5 , else < 0.5 45
4. Softmax Regression:
 The Logistic Regression model can be generalized to support multiple classes
directly, without having to train and combine multiple binary classifiers.
 This is called Softmax Regression, or Multinomial Logistic Regression.
Softmax Score for class k:
Multinomial Logistic Regression:
Target variable can have 3 or more possible types
Softmax Function: which are not ordered (i.e. types have no
quantitative significance) like “disease A” vs
“disease B” vs “disease C”.

46
PART2

DECISION TREES
 The CART Training Algorithm
 Attribute selection measures- Gini impurity,
Entropy.
 Training and Visualizing a Decision Tree
 Making Predictions
 Estimating Class Probabilities
 Regularization Hyperparameters, Regression
Trees, Linear Support vector machines.
47
DECISION TREE CLASSIFICATION:
Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
It is a tree-structured classifier,
 Internal nodes represent the features of a dataset,
 Branches represent the decision rules and
 Each leaf node represents the outcome.
 Leaf node is called the purest node.

 In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
 Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.

48
Types of tree algorithms:
 In order to build a tree,
 we use the CART algorithm, which stands for Classification and Regression
Tree algorithm.
 Iterative Dichotomiser-3 ( ID3) algorithm.

FLOW CHART:

49
Example

50
Why use Decision Trees?
 Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
 The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Decision Tree Terminologies:


 Root Node:
Root node is from where the decision tree starts. It represents the entire dataset.
 Leaf Node:
Leaf nodes are the final output node.
 Splitting:
Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
 Branch/Sub Tree:
A tree formed by splitting the tree.

51
 Pruning:
Pruning is the process of removing the unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.

Decision Tree algorithm Work:


 Step-1:
Begin the tree with the root node, says S, which contains the complete dataset.
 Step-2:
Find the best attribute in the dataset using Attribute Selection Measure (ASM).
 Step-3:
Divide the S into subsets that contains possible values for the best attributes.
 Step-4:
Generate the decision tree node, which contains the best attribute.
 Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
52
Types of decision trees:
1. Classification Trees
CART  ID3 C4.5 and C5.0
2. Regression Trees
CART ID3

Attribute Selection Measures:


1.Information Gain
2.Gini Index
3.Entropy
Gini Impurity:
It measures the likelihood of an incorrect classification of a new instance when It
was randomly classified.

53
Entropy:
It measures the amount of uncertainity or impurity of dataset.

pi= the probability of an instance being classified into particular class

Information Gain:
It measures the reduction of an entropy after data set is splitting on an attribute.

Expected reduction in entropy knowing A


Gain(S, A) = Entropy(S) −  |sv| / |s| * Entropy(Sv)
v  Values(A)
Values(A) possible values for A
Sv subset of S for which A has value v
54
Iterative Dichotomiser 3
ID3
 It aims to build a decision tree by iteratively selecting the best attribute to split the
data based on information gain.
 Each node represents a test on an attribute, and each branch represents a possible
outcome of the test.
 The leaf nodes of the tree represent the final classifications.
Thе ID3 (Iterative Dichotomiser 3) algorithm is a classic decision tree algorithm used for
both classification and regression tasks
The ID3 use the1.Entropy is a measure of impurity or randomness in a dataset.
E(S)=Σ−(Pi​∗log2​(Pi​))
2. Information gain is used to minimize uncertainty or entropy. It
indicates how much useful information is available in the attribute to
accurately classify the data
IG(S,A)=E(S)– [∑​( ∣Sv​∣​/∣S∣)×E(Sv​)] 55
Steps in ID3 algorithm:
1.Determine entropy for the overall the dataset using class distribution.
2.For each feature.
i)Calculate Entropy for Categorical Values.
ii)Assess information gain for each unique categorical value of the feature.
3.Choose the feature that generates highest information gain.
4.Iteratively apply all above steps to build the optimal decision tree structure.
Advantages of ID3:
 Simple and easy to understand.
 Requires little training data.
Disadvantages of ID3:
•Can lead to overfitting.
•May not be effective with data with many attributes.
Applications of ID3
1.Fraud detection.
2.Medical diagnosis
3.Customer segmentation
56
Dataset of Play Tennis(Yes/NO)

Day Outlook Temperature Humidity Wind Play Tennis


D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rainy Mild High Weak Yes
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak NO
D9 Sunny Cool Normal Weak Yes
D10 Rainy Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rainy Cool High Strong No
Process to find Entropy and Information gain
Entropy of dataset:
Entropy([9+,5-]) = -log2(Pi)
= -P+ log2P+-P-log2P-
where Pi is the probability of an instance being classified into particular class.

Information gain:
Information gained = Entropyparent - * Entropy(Di))
Di is the subset of D after splitting an attribute
Note:
• The value of positive and negative are equal then the entropy is directly 1.0
• The value of either positive or negative is 0 then entropy is 0.0
Entropy and Information gain of whole dataset
Entropy ([9+,5-]) = -())
Outlook Yes No
= 0.940
Sunny 2 3
Information gain = Entropyparent - * Entropy(Di))
Overcast 4 0
Rain 3 2
EntropySunny ([2+,3-]) = -()) = 0.971
EntropyOvercast ([4+,0-]) = -() = 0.0
EntropyRain ([3+,2-]) = -()) = 0.971
Information gained for Outlook = Entropy(S) - * Entropy(Sv))
= Entropy(S) -( * Entropy(Ssunny) -( * Entropy(Sovercast) -( * Entropy(Srain)
= 0.940 - * 0.971 - * 0.0 - * 0.971
= 0.2464
Entropy and Information gain of Temperature

EntropyHot ([2+,2-]) = -())


Temperature Yes No
= 1.0
Hot 2 2
EntropyMild ([4+,2-]) = -())
Mild 4 2
= 0.918
Cool 3 1
EntropyCool ([3+,2-]) = -())
= 0.811
Information gain = Entropy(S) - * Entropy(Sv))
= Entropy(S) -( * Entropy(SHot) -( * Entropy(SMild) -( * Entropy(SCool)
= 0.940 - * 1.0 - * 0.918 - * 0.811
= 0.029
Entropy and Information gain of Humidity
EntropyHigh ([3+,4-]) = -()) = 0.985
EntropyNormal ([6+,1-]) = -()) = 0.592 Humidity Yes No
Information gain = Entropy(S) - * Entropy(Sv))
High 3 4
= Entropy(S) -( * Entropy(SHigh) -( * Entropy(SNormal)
= 0.940 - * 0.985 - * 0.592 = 0.1515 Normal 6 1

Entropy and Information gain of Wind


EntropyHigh ([6+,2-]) = -()) = 0.811
EntropyNormal ([3+,3-]) = -()) = 1.0
Information gain = Entropy(S) - * Entropy(Sv)) Wind Yes No
= Entropy(S) -( * Entropy(SWeak) -( * Entropy(SStrong)
= 0.940 - * 0.985 - * 0.592 = 0.048 Weak 6 2

Strong 3 3
Maximum Information Gained is “ Outlook “
Therefore, Outlook comes as root node
Decision Tree after finding information gain for the whole dataset

As overcast gives purest node, we consider overcast to be Yes and there is no need
of splitting for Overcast…
Outlook

Sunny Rain
Overcast

Yes
After splitting whole dataset…

Now, we need to split Sunny and Rain since they are still impure nodes.
For getting the purest nodes for both sunny and rain, we need to repeat the same process for the
attributes humidity, temperature and wind.
Based on the maximum information gain, we place the internal nodes as either Humidity or
Temperature or Wind.
We check until the purest nodes enter in the decision tree and reach the leaf nodes.
Outlook

Sunny Overcast Rain

??? Yes ???


final output:

64
Problems on Decision Trees:
Consider the following set of training examples:
a) What is the entropy of this collection of training examples with respect to the target function
classification?
b) What is the information gain of a2 relative to these training examples.

65
CART
ALGORITHM
 Classification and Regression Trees (CART) is a decision tree algorithm that is used for
both classification and regression tasks.
 It is a supervised learning algorithm that learns from labelled data to predict the outcome
of unseen data.
The CART algorithm is a type of classification algorithm that is required to build a
decision tree on the basis of Gini's impurity index.

It works by recursively partitioning the data into smaller and smaller subsets
based on certain criteria(condition).

The goal is to create a tree structure that can accurately classify or predict the
target value for new data points.
Mathematically, we can write Gini Impurity as follows:
Gini=1−∑(pi)2
Here pi indicates the probability of class i. 66
The CART algorithm uses following systematic approach to build decision trees.
1. Feature Selection:
•Start by evaluating each feature's ability to classify the data set effectively.
•Measure the impurity of attributes using metrics like the Gini Index for classification or
mean squared error for regression.
•Choose the feature and the split point that results in the most significant reduction in
impurity.
2. Splitting:
CART uses a greedy approach to split the data at each node
It evaluates all possible splits and selects the one that best reduces the impurity of the
resulting subsets
•Once the best feature is identified, split the data set into various subsets based on this
attribute..
•This creates child nodes, each representing a subset of the data set based on the selected
feature's value.
3. Tree Building:
• Tree can be created by recursively apply the above two steps for each child node,
considering only the subset of data within that node.
•Continue this process until a stopping criterion is met, such as a maximum tree depth or a
minimum number of samples in a node.
4. Tree Pruning:
•When the full tree is created, the pruning process begins. 67

Advantages of CART
 Results are simplistic.
 Classification and regression trees are Nonparametric and Nonlinear.
 Classification and regression trees implicitly perform feature selection.
 Outliers have no meaningful effect on CART.
 It requires minimal supervision and produces easy-to-understand models.
Limitations of CART
 Overfitting.
 High Variance.
 low bias.
 the tree structure may be unstable.
Applications of the CART algorithm
 For quick Data insights.
 In Blood Donors Classification.
 For environmental and ecological data.
 In the financial sectors.

68
Examples:
Day Outlook Temp Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rainy Mild High Weak Yes
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rainy Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rainy Mild High Strong No
69
Outlook Step1: calculate the gini index for each attribute
Number of Gini Index Calculations:
Outlook Yes No  Gini(Outlook=Sunny) = 1 - (2/5)^2 - (3/5)^2 = 0.48
instances
Sunny 2 3 5  Gini(Outlook=Overcast) = 1 - (4/4)^2 - (0/4)^2 = 0
Overcast 4 0 4  Gini(Outlook=Rain) = 1 - (3/5)^2 - (2/5)^2 = 0.48
Rain 3 2 5 Weighted Gini Index for Outlook:
Gini(Outlook) = (5/14) * 0.48 + (4/14) * 0 + (5/14) * 0.48 = 0.342

Temperature
Gini Index Calculations:
Temperatu Number of
re
Yes No
instances  Gini(Temp=Hot) = 1 - (2/4)^2 - (2/4)^2 = 0.5
 Gini(Temp=Cool) = 1 - (3/4)^2 - (1/4)^2 = 0.375
Hot 2 2 4  Gini(Temp=Mild) = 1 - (4/6)^2 - (2/6)^2 = 0.445
Cool 3 1 4 Weighted Gini Index for Temperature:
Mild 4 2 6 Gini(Temp) = (4/14) * 0.5 + (4/14) * 0.375 + (6/14) * 0.445 =
0.439
Humidity
Gini Index Calculations:
Number of  Gini(Humidity=High) = 1 - (3/7)^2 - (4/7)^2 = 0.489
Humidity Yes No
instances
 Gini(Humidity=Normal) = 1 - (6/7)^2 - (1/7)^2 = 0.244
High 3 4 7
Weighted Gini Index for Humidity:
Normal 6 1 7 Gini(Humidity) = (7/14) * 0.489 + (7/14) * 0.244 = 0.367
70
Wind
Number of Gini Index Calculations:
Wind Yes No instances  Gini(Wind=Weak) = 1 - (6/8)^2 - (2/8)^2 = 0.375
 Gini(Wind=Strong) = 1 - (3/6)^2 - (3/6)^2 = 0.5
Weak 6 2 8
Weighted Gini Index for Wind:
Strong 3 3 6
Gini(Wind) = (8/14) * 0.375 + (6/14) * 0.5 = 0.428

We've calculated gini index values for each feature. We will select outlook feature
because its cost is the lowest.
Feature Gini index
Outlook 0.342
Temperature 0.439
Humidity 0.367
Wind 0.428

71
72
Sunny dataset
Day Outlook Temp. Humidity Wind Decision
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes

Gini of Temperature for Sunny Outlook


Gini Index Calculations:
Temperat
Yes No
Number of  Gini(outlook=sunny and Temp=Hot) = 1 - (0/2)^2 - (2/2)^2 = 0
ure instances  Gini(outlook=sunny and Temp=Cool) = 1 - (1/1)^2 - (0/1)^2 =
Hot 0 2 2 0
Cool 1 0 1  Gini(outlook=sunny and Temp=Mild) = 1 - (1/2)^2 - (1/2)^2 =
Mild 1 1 2 0.5
Weighted Gini Index for Temperature (Sunny Outlook):
Gini(outlook=sunny and Temp) = (2/5) * 0 + (1/5) * 0 + (2/5) * 0.5 =
0.2
73
Gini of Humidity for Sunny Outlook
Humidit Number of Gini of Humidity for Sunny Outlook
Yes No Gini Index Calculations:
y instances
High 0 3 3  Gini(Outlook=Sunny and Humidity=High) = 1 - (0/3)^2 - (3/3)^2 = 0
Normal 2 0 2  Gini(Outlook=Sunny and Humidity=Normal) = 1 - (2/2)^2 - (0/2)^2
=0
Weighted Gini Index for Humidity (Sunny Outlook):
Gini(Outlook=Sunny and Humidity) = (3/5) * 0 + (2/5) * 0 = 0
Gini of Wind for Sunny Outlook
Gini of Wind for Sunny Outlook
Ye Number of Gini Index Calculations:
Wind No
s instances
 Gini(Outlook=Sunny and Wind=Weak) = 1 - (1/3)^2 - (2/3)^2 = 0.266
Weak 1 2 3
 Gini(Outlook=Sunny and Wind=Strong) = 1 - (1/2)^2 - (1/2)^2 = 0.2
Strong 1 1 2
Weighted Gini Index for Wind (Sunny Outlook):
Gini(Outlook=Sunny and Wind) = (3/5) * 0.266 + (2/5) * 0.2 = 0.466
Decision for Sunny Outlook
Feature Gini index
Temperature 0.2 We've calculated gini index values for each
feature. We will select Humidity feature
Humidity 0 because its cost is the lowest.
Wind 0.466 74
75
Gini of Temperature for Rainy Outlook
Gini Index Calculations:
N
Humidity Yes Number of instances  Gini(Outlook=Rainy and Temp=Cool) = 1 - (1/2)^2 - (1/2)^2 = 0.5
o
 Gini(Outlook=Rainy and Temp=Mild) = 1 - (2/3)^2 - (1/3)^2 = 0.444
High 1 1 2 Weighted Gini Index for Temperature (Rain Outlook):
Normal 2 1 3 Gini(Outlook=Rainy and Temp) = (2/5) * 0.5 + (3/5) * 0.444 = 0.466

Gini of Humidity for Rainy Outlook


Temperat Number of Gini Index Calculations:
Yes No
ure instances  Gini(Outlook=Rainy and Humidity=High) = 1 - (1/2)^2 - (1/2)^2 = 0.5
Cool 1 1 2  Gini(Outlook=Rainy and Humidity=Normal) = 1 - (2/3)^2 - (1/3)^2 = 0.444
Mild 2 1 3 Weighted Gini Index for Humidity (Rain Outlook):
Gini(Outlook=Rainy and Humidity) = (2/5) * 0.5 + (3/5) * 0.444 = 0.46676
Gini of Wind for Rainy Outlook

Number of Gini Index Calculations:


Wind Yes No  Gini(Outlook=Rainyand Wind=Weak) = 1 - (3/3)^2 - (0/3)^2 = 0
instances
Weak 3 0 3  Gini(Outlook=Rainy and Wind=Strong) = 1 - (0/2)^2 - (2/2)^2 =
Strong 0 2 2 0
Weighted Gini Index for Wind (Rainy Outlook):
Gini(Outlook=Rainy and Wind) = (3/5) * 0 + (2/5) * 0 = 0

Decision for Rainy Outlook


Feature Gini index
Temperature 0.466
Humidity 0.466
Wind 0

We've calculated gini index values for each feature. We will


select wind feature because its cost is the lowest.

77
78
EXAMPLE OF CART ALGORITHM:

Leaf
node True False
Leaf
Leaf
node
node

79
Regression Trees
Regression Trees are a type of decision tree used when the target variable is continuous
(numerical).
To calculate the regression tress dataset. formula for standard deviation (SD)

80
EXAMPLE:

81
EXAMPLE:

82
Step1:
Compute the standard deviation from all target values.

For result:
Average==75

=
= 15.69
83
Step 2: Calculate the sd for each feature.
Let Attribute:
Let Assessment==good:

=
= 9.74
Let Assessment==Average:

=
= 8.9999
84
Assessment==Poor:

= 10

The weighted standard deviation is:


wsd=

Weighted reduced standard deviation RSD=15.69-9.57


= 6.12

85
Let Attribute
Assignment =YES

= 13.40
Assignment =NO

= 13.14
The weighted standard deviation is:
wsd=
Weighted reduced standard deviation RSD=15.69-13.27
= 2.42
86
Let Attribute
Project=YES

= 11.54
Project =NO

= 11.59
The weighted standard deviation is:
wsd=
Weighted reduced standard deviation RSD=15.69-13.27
= 4.13
87
RSD
According to highest Reduced Standard Deviation(RSD) ,
Assessment 6.12 we have to select and make root node:
Assignment 2.42
Project 4.13 DECISION TREE

88
Now, The root node is declared , we have to
continue the process until tree is formed.

Assessment==good:

=9.74
Let Attribute:
Assignment==yes Assignment==no
Average=(95+98+89) ===94 Average=(75+75) === 75
Sd=3.74 Sd=0

89
The weighted standard deviation is:
wsd= 0
Weighted reduced standard deviation RSD=9.74-2.244
= 7.496

Let Attribute:
Assignment==yes Assignment==no

Average=(95+75+98+89) ===89.25 Average=(75) === 75


Sd=8.84 Sd=0

The weighted standard deviation is:


wsd= 0
Weighted reduced standard deviation RSD=9.74-7.072
= 2.668

90
RSD
Assignment 7.49 According to highest Reduced Standard Deviation(RSD) ,
Project 2.44 we have to select and child node:

DECISION TREE

Let Attribute , Average 


All the features are completed so
Project=Yes ==80

Project No==58,70

91
FINAL TREE ASSESSMENT

GOOD AVERAGE POOR

ASSIGNMENT
PROJECT YES NO

65
YES NO 45
NO YES

95,98,89 75,75 70,58 80


92
EXAMPLE FOR REGRESSION TREES:

93
94
ANSWER
Disadvantages of the Decision Tree
 The decision tree contains lots of layers, which makes it complex.
 It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
 For more class labels, the computational complexity of the decision tree may
increase.

95
Support Vector Machine:

 A Support Vector Machine (SVM) is a supervised Learning algorithm used for


both classification and regression tasks.
 While it can be applied to regression problems, SVM is best suited
for classification tasks.
 A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning
model, capable of performing linear or nonlinear classification, regression, and even
outlier detection.

 SVMs are particularly well suited for classification of complex but small- or medium-
sized datasets.
 A Support Vector Machine (SVM) performs classification by finding the hyperplane
that maximizes the margin between the two classes.
 A hyperplane is a decision boundary that differentiates the two classes in SVM.

96
GRAPH FOR LINEAR AND NON LINEAR PROBLEMS:

97
 The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.

 SVM chooses the extreme


points/vectors that help in
creating the hyperplane.
These extreme cases are
called as support vectors, and
hence algorithm is termed as
Support Vector Machine.

98
Types of Support Vector Machines:
Based on the nature of the decision boundary, Support Vector Machines (SVM) can be
divided into two main parts:
 Linear SVM: Linear SVMs use a linear decision boundary to separate the data points
of different classes. When the data can be precisely linearly separated, linear SVMs are
very suitable. This means that a single straight line (in 2D) or a hyperplane (in higher
dimensions) can entirely divide the data points into their respective classes. A
hyperplane that maximizes the margin between the classes is the decision boundary.
 Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be
separated into two classes by a straight line (in the case of 2D). By using kernel
functions, nonlinear SVMs can handle nonlinearly separable data. The original input
data is transformed by these kernel functions into a higher-dimensional feature space,
where the data points can be linearly separated. A linear SVM is used to locate a
nonlinear decision boundary in this modified space.

99
1) Linear SVM Classification

Hard Margin SVM % overfitting


 In a hard margin SVM, the objective is to identify a hyperplane that completely
separates data points belonging to different classes, ensuring a clear demarcation with
the utmost margin width possible.
 This margin is the distance between the hyperplane and the nearest data point, also
known as the support vectors.
 The hyperplane equation plays a crucial role in hard margin SVMs because it defines the
boundary that separates the classes.
 Ideally, we want this boundary to have a maximum margin from the nearest data points
of each class.
100
 Mathematically, for a linearly separable
dataset, the decision function of a hard
margin SVM can be expressed as:
 yi(xTω+b)≥1
 Where,
 x is the weight vector perpendicular to the
hyperplane. ω is the input feature vector. b
is the bias term.
 The hyperplane equation in a hard margin SVM
defines the decision boundary that separates the
data points of different classes.
 The decision boundary is
determined by the hyperplane
equation xTω+b=0
The margin is given by the distance between the
hyperplane and the closest data point of each
class, which can be computed as:
101
margin=1/w
2) Soft Margin Classification % under
fitting
 The objective is to find the good balance b/w keeping he street as large as
possible and limiting the margin violations.
 Soft Margin SVM introduces flexibility by allowing some margin violations
(misclassifications) to handle cases where the data is not perfectly separable.
 Suitable for scenarios where the data may contain noise or outliers. It Introduces a
penalty term for misclassifications, allowing for a trade-off between a wider margin
and a few misclassifications.
 Soft margin SVM allows for some margin violations, meaning that it permits
certain data points to fall within the margin or even on the wrong side of the
decision boundary.
 This adaptability is managed by a factor called C, also called the "regularization
parameter," which helps find a balance between making the gap as big as possible
and reducing mistakes in grouping things.
102
 Mathematically, decision function of a soft margin
SVM can defined as:
yi(xTωi+b)≥1−ξi​
Where:
ξiξi​are slack variables representing the
margin violations. yi are the target labels.

 The term (1−ξi ) represents the minimum required

margin for each data point.


 The objective function of a soft margin SVM combines the margin maximization with
a penalty term for margin violations, minimizing:
Ni=1/2(w/2+C∑)

103
104
Hard Margin vs Soft Margin in
SVM
Criteria Hard Margin Soft Margin
Objective Maximize margin, minimize
Function Maximize margin. margin violations.
Sensitive, requires perfectly Robust, handles noisy data with
Handling Noise linearly separable data margin violations.
Not applicable, no Controlled by regularization
Regularization regularization parameter parameter C.
Simple, computationally May require more
Complexity efficient computational resources

105
Non-Linear SVM:
 If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line.

 So to separate these data points, we need to add one


more dimension.
 For linear data, we have used two dimensions x and
y, so for non-linear data, we will add a third
dimension z. It can be calculated as:

 z=x2 +y2
 By adding the third dimension, the sample space will
become as below image:

106
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:

Hence we get a circumference of


radius 1 in case of non-linear data.

107

You might also like