21CS54(AI/ML) Prof.
Thameeza
MODULE 4
DECISION TREE LEARNING & BAYESIAN LEARNING
Introduction
Why called as decision tree?
As starts from root node and finds number of solutions.
The benefits of having a decision tree are as follows :
It does not require any domain knowledge.
It is easy to comprehend.
The learning and classification steps of a decision tree are simple and fast.
Example : Toll free number
Structure of a Decision Tree
A decision tree is a structure that includes a root node, branches, and leaf nodes.
Each internal node denotes a test on an attribute, each branch denotes the
outcome of a test, and each leaf node holds a class label.
The topmost node in the tree is the root node and Applies to classification and
regression model.
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Figure 6.1 shows symbols that are used to represent different nodes in
the construction of a decision trees.
Decision networks are also called as influence diagrams.
The decision tree consists of 2 major procedures:
1) Building a tree and
2) Knowledge inference or classification.
Building the Tree
Knowledge Inference or Classification
Advantages of Decision Trees
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Disadvantages of Decision Trees
Fundamentals of Entropy
Given a training dataset with a set of attributes, the decision tree is constructed
by finding the attribute that best describes the target class for the given test
instances.
The best split attribute is the one which contains more information about how to
split the dataset among all features so that target class is accurately identified for
the test instances.
The splitting should be pure at every stage of selecting the best feature.
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Algorithm 6.1: General Algorithm for Decision
Trees
Decision tree induction algorithms
ID3 Tree Construction(ID3 stands for Iterative Dichotomiser 3 )
A decision tree is one of the most powerful tools of supervised learning
algorithmsused for both classification and regression tasks.
It builds a flowchart-like tree structure where each internal node denotes a test
on anattribute, each branch represents an outcome of the test, and each leaf
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
node (terminal node) holds a class label.
It is constructed by recursively splitting the training data into subsets based on
the values of the attributes until a stopping criterion is met, such as the
maximum depth of the tree or the minimum number of samples required to split
a node.
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Step 1: Calculate the Entropy for the target class ‘Job Offer’.
Entropy_Info (7, 3) =
Iteration 1:
Step 2: Calculate the Entropy info and Gain (Information gain)for each of the attribute
in the training dataset
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Step 3: From table 6.8 choose the attribute for which entropy is minimum and
therefore the gain is maximum as the best split attribute
Iteration 2:
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Example: Construct a decision tree for the training dataset below using
ID3 Algorithm
C4.5 Construction
C4.5 is a widely used algorithm for constructing decision trees from a dataset.
Disadvantages of ID3 are: Attributes must be nominal values, dataset must
not includemissing data, and finally the algorithm tend to fall into over fitting.
To overcome this disadvantage Ross Quinlan, inventor of ID3, made some
improvements for these bottlenecks and created a new algorithm named C4.5.
Now, thealgorithm can create a more generalized models including continuous
data and could handle missing data. And also works with discrete data,
supports post-pruning.
Altedegree.com
21CS54(AI/ML) Prof. Thameeza
Example: Construct a decision tree for the table 6.3 using C4.5
Iteration 1:
Step 1: Calculate the Entropy for the target class ‘Job Offer’.
Entropy_Info (7, 3) =
Step 2: Calculate the Entropy info and Gain (Information gain), Split info and gain
ratio for each of the attribute in the training dataset
.
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Step 3: Choose the attribute for which gain ratio is maximum as the best split
attribute
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
The final decision tree is shown in figure below.
Example 2:
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Dealing with Continuous Attributes in C4.5
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Similarly, the calculations are done for each of the distinct value for the
attribute CGPA and a table is created. Now, the value of GPA with maximum
gain is chosen as the threshold value or the best split point. From Table 6.13, we
can observe that CGPA with 7.9 has the maximum gain as 0.4462.
Hence, CGPA € 7.9 is chosen as the split point. Now, we can discretize the
continuous values Of CGPA as two categories with CGPA $ 7.9 and GPA >
7.9. The resulting discretized instances are shown in Table 6.14.
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Classification and Regression Trees Construction(CART)
Classification and Regression Trees (CART) is a widely used algorithm for
constructing decision trees that can be applied to both classification and
regression tasks. CART is similar to C4.5 but has some differences in its
construction and splittingcriteria.
It constructs a tree as a binary tree by recursively splitting a node into two
nodes. Therefore even if an attribute has more than two possible values. GINI
index is calculated for all the subsets of attributes and the subset which has
maximum value is selected as the best split subset.
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Example: Construct a decision tree for the table 6.3 using CART
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Regression Trees
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Home Work Problems
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
VALIDATING AND PRUNING OF DECISION TREES
Validating and pruning decision trees is a crucial part of building accurate and
robust machine learning models.
Decision trees are prone to over fitting, which means they can learn to capture
noise and details in the training data that do not generalize well to new, unseen
data.
Validation and pruning are techniques used to mitigate this issue and improve
the performance of decision tree models.
The pre-pruning technique of Decision Trees is tuning the hyper parameters
prior to the training pipeline. It involves the heuristic known as ‘early stopping’
which stops thegrowth of the decision tree - preventing it from reaching its full
depth.
It stops the tree-building process to avoid producing leaves with small samples.
During each stage of the splitting of the tree, the cross-validation error will be
monitored.
If the value of theerror does not decrease anymore - then we stop the growth of
the decision tree.
The hyper parameters that can be tuned for early stopping and preventing over
fitting are: max_depth, min_samples_leaf, and min_samples_split
These same parameters can also be used to tune to get a robust model
Post-pruning does the opposite of pre-pruning and allows the Decision Tree
model to grow to its full depth. Once the model grows to its full depth, tree
branches are removed to prevent the model from over fitting.
The algorithm will continue to partition data intosmaller subsets until the final
subsets produced are similar in terms of the outcome variable.
The final subset of the tree will consist of only a few data points allowing the
tree to have learned the data to the T. However, when a new data point is
introduced that differs from the learned data - it may not get predicted well.
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Chapter 8
Bayesian Learning
8.1 Introduction to probability-based learning
8.2 Fundamentals of Bayes theorem
8.3 Classification using Bayes model
8.3.1 Naïve Bayes Algorithm
8.3.2 Brute Force Bayes Algorithm
8.3.3 Bayes optimal classifier
8.3.4 Gibbs Algorithm
8.4 Naïve Bayes Algorithm for continuous attributes
8.5 other popular types of naïve Bayes classifiers
Bayesian learning
Bayesian Learning is a learning method that describes and represents knowledge
in an uncertain domain and provides a way to reason about this knowledge
using probability measure.
It uses Bayes theorem to infer the unknown parameters of a model. Bayesian
inference is useful in many applications which involve reasoning and
diagnosis such as game theory, medicine, etc.
Bayesian inference is much more powerful in handling missing data and
for estimating any uncertainty in predictions.
8.1 Introduction to probability-based learning
Probability-based learning is one of the most important practical learning
methods which combines prior knowledge or prior probabilities with
observed data.
Probabilistic learning uses the concept of probability theory that describes
how to model randomness, uncertainty, and noise to predict future events.
It is a tool for modelling large datasets and uses Bayes rule to infer unknown
quantities, predict and learn from data. In a probabilistic model, randomness
plays a major role which gives probability distribution a solution, while in a
deterministic model there is no randomness and hence it exhibits the same
initial conditions every time the model is run and is likely to get a single
possible outcome as the solution.
Bayesian learning differs from probabilistic learning as it uses subjective
probabilities (ie, probability that is based on an individual's belief or
interpretation about the outcome of an event and it can change over time) to
infer parameters of a model.
Two practical learning algorithms called Naive Bayes learning and
Bayesian Belief Network (BBN) form the major part of Bayesian learning.
These algorithms use prior probabilities and apply Bayes rule to infer
useful information
8.2 Fundamentals of Bayes theorem
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Naive Bayes Model relies on Bayes theorem
It works on the principle of three kinds of probabilities called prior
probability, likelihood probability, and posterior probability.
Prior Probability
It is the general probability of an uncertain event before an observation is seen
or some evidence is collected.
It is the initial probability that is believed before any new information
is collected.
Likelihood Probability
Likelihood probability is the relative probability of the observation occurring
for each class or the sampling density for the evidence given the hypothesis.
It is stated as P (Evidence I Hypothesis), which denotes the likeliness of
the occurrence of the evidence given the parameters.
Posterior Probability
It is the updated or revised probability of an event taking into account
the observations from the training data.
P (Hypothesis I Evidence) is the posterior distribution representing the
belief about the hypothesis, given the evidence from the training data.
Therefore,
Posterior probability = prior probability + new evidence
8.3 Classification using bayes model
Naive Bayes Classification models work on the principle of Bayes theorem.
Bayes' rule is a mathematical formula used to determine the posterior probability,
Given prior probabilities of events. Generally, Bayes theorem is used to select
the most probable hypothesis from data, considering both prior knowledge
and posterior distributions.
It is based on the calculation of the posterior probability and is stated as:
P (Hypothesis h I Evidence E)
Where, Hypothesis ‘h’ is the target class to be classified and Evidence ‘E’ is
the given test instance. P (Hypothesis h) is the prior probability of the
hypothesis h without observing the training data or considering any evidence.
Where, P (Hypothesis h) is the prior probability of the hypothesis h
without observing the training data or considering any evidence.
It denotes the prior belief or the initial probability that the hypothesis h is correct.
P (Evidence E) is the prior probability of the evidence E from the training
dataset without any knowledge of which hypothesis holds. It is also called the
marginal probability.
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Maximum A Posteriori (MAP) Hypothesis, hMAP
Given a set of candidate hypotheses, the hypothesis which has the
maximum value is considered as the maximum probable hypothesis or most
probable hypothesis.
This most probable hypothesis is called the Maximum A Posteriori
Hypothesis hMAP. Bayes theorem Eq. (8.1) can be used to find the HMAP.
Maximum Likelihood (ML) Hypothesis, hML
Given a set of candidate hypotheses, if every hypothesis is equally probable,
only P (E \ h) is used to find the most probable hypothesis.
The hypothesis that gives the maximum likelihood for P (E\h) is called
the Maximum Likelihood (ML) Hypothesis, hML.
Correctness of Bayes theorem
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
One related concept of Bayes theorem is the principle of Minimum
Description Length (MDL).
The minimum description length (MDL) principle is yet another
powerful method like Occam's razor principle to perform inductive
inference.
It states that the best and most probable hypothesis is chosen for a set of
observed data or the one with the minimum description. Recall from Eq. (8.2)
Maximum A Posteriori (MAP) Hypothesis, hwAR which says that given a set
of candidate hypotheses, the hypothesis which has the maximum value is
considered as the maximum probable hypothesis or most probable hypothesis.
Naive Bayes algorithm uses the Bayes theorem and applies this MDL
principle to find the best hypothesis for a given problem.
8.3.1 Naïve Bayes Algorithm
It is a supervised binary class or multi class classification algorithm that
works on the principle of Bayes theorem.
There is a family of Naive Bayes classifiers based on a common principle.
These algorithms classify for datasets whose features are independent and each
feature is assumed to be given equal weightage. It particularly works for a large
dataset and is very fast.
It is one of the most effective and simple classification algorithms.
This algorithm considers all features to be independent of each other
even though they are individually dependent on the classified object.
Each of the features contributes a probability value independently
during classification and hence this algorithm is called as Naive
algorithm.
Some important applications of these algorithms are text
classification, recommendation system and face recognition.
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
As explained earlier the Likelihood probability is stated as the sampling
density for the evidence given the hypothesis.
It is denoted as P (Evidence I Hypothesis), which says how likely the
occurrence of the evidence given the parameters is.
It is calculated as the number of instances of each attribute value and for a
given class value divided by the number of instances with that class value.
For example P (CGPA ≥9 I Job Offer = Yes) denotes the number of instances
with 'CGPA 29’ and 'Job Offer = Yes' divided by the total number of
instances with 'Job Offer = Yes'.
From the Table 8.3 Frequency Matrix of GPA, number of instances with
'CGPA 29' and 'Job Offer = Yes' is 3.
The total number of instances with Job Offer = Yes' is 7.
Hence, P (CGPA 29 | Job Offer = Yes) = 3/7.
Similarly, the Likelihood probability is calculated for all attribute values
of feature CGPA.
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
8.3.2 Brute Force Bayes Algorithm
Applying Bayes theorem, Brute Force Bayes algorithm relies on the idea of
concept learning wherein given a hypothesis space H for the training dataset
I, the algorithm computes the posterior probabilities for all the hypothesis h,
eH.
Then, Maximum A Posteriori (MAP) Hypothesis, hmap is used to output
the hypothesis with maximum posterior probability. The algorithm is quite
expensive since it requires computations for all the hypotheses.
Although computing posterior probabilities is inefficient, this idea is applied
in various other algorithms which is also quite interesting.
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
8.3.3 Bayes Optimal Classifier
Bayes optimal classifier is a probabilistic model, which in fact, uses the
Bayes theorem to find the most probable classification for a new instance
given the training data by combining the predictions of all posterior
hypotheses,
This is different from Maximum A Posteriori (MAP) Hypothesis, Krap
which chooses the maximum probable hypothesis or the most probable
hypothesis.
Here, a new instance can be classified to a possible classification value C, by
the following
Hmap chooses h1 which has the maximum probability value 0.3 has the
solution and gives result that the patient is covid negative.
But Bayes Optimal Classifier combines the predictions of h2, h3 and h4 which is
0.4 and gives the result that the patient is covid negative.
Therefore max Ci ∈ [COVID Positive, Covid Negative] Σhi ∈H P(Ci\
hi)P(hi\T) = COVID Positive
Thus the algorithm, diagnoses the new instance to be COVID Positive
8.3.4 Gibbs Algorithm
The main drawback of Bayes optimal classifier is that it computes the
posterior probability foral hypotheses in the hypothesis space and then
combines the predictions to classity a new instance.
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Gibbs algorithm is a sampling technique which randomly selects a hypothesis
from the hypothesis space according to the posterior probability distribution
and classifies a new instance
It is found that the prediction error occurs twice with the Gibbs algorithm
when compared to Bayes Optimal classifier.
8.4 Naive Bayes Algorithm For Continuous Attributes
There are two ways to predict with Naive Bayes algorithm for
continuous attributes:
1. Discretize continuous feature to discrete feature.
2. Apply Normal or Gaussian distribution for continuous feature.
Gaussian Naive Bayes Algorithm
In Gaussian Naive Bayes, the values of continuous features are assumed to
be sampled from a Gaussian distribution.
Example 8.4:
Assess a student's performance using Naïve Bayes algorithm for the continuous
attribute. Predict whether a student gets a job offer or not in his final year of the
course. The training dataset T consists of 10 data instances with attributes such
as 'CGA' and 'Interactiveness' as shown in Table 8.13. The target variable is Job
Offer which is classified as Yes or No for a candidate student.
Solution:
Step 1: Compute the prior probability for the target feature Job offer
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza
8.5 OTHER POPULAR TYPES OF NAIVE BAYES CLASSIFIERS
Some of the popular variants of Bayesian classifier are listed below:
Bernoulli Naive Bayes Classifier: Bernoulli Naive Bayes works with discrete
features. In this algorithm, the features used for making predictions are
Boolean variables that take only two values either 'yes' or 'no'. This is
particularly useful for text classification where all features are binary with each
feature containing two values whether the word occurs or not.
Multinomial Naive Bayes Classifier: This algorithm is a generalization of the
Bernoulli Naive Bayes model that works for categorical data or particularly
integer features. This classifier is useful for text classification where each
feature will have an integer value that represents the frequency of occurrence of
words.
Multi-class Naive Bayes Classifier: This algorithm is useful for classification
problems with more than two classes where the target feature contains
multiple classes and test instance has to be predicted with the class it belongs
to.
Altedegree.co
m