Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views51 pages

Module 4

The document discusses Decision Tree Learning and Bayesian Learning, outlining the structure and advantages of decision trees, including their construction and classification processes. It also covers Bayesian learning methods, emphasizing the use of Bayes theorem for classification and the importance of prior and posterior probabilities. Additionally, it addresses validation and pruning techniques for decision trees to prevent overfitting and improve model accuracy.

Uploaded by

Mukunda T
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views51 pages

Module 4

The document discusses Decision Tree Learning and Bayesian Learning, outlining the structure and advantages of decision trees, including their construction and classification processes. It also covers Bayesian learning methods, emphasizing the use of Bayes theorem for classification and the importance of prior and posterior probabilities. Additionally, it addresses validation and pruning techniques for decision trees to prevent overfitting and improve model accuracy.

Uploaded by

Mukunda T
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 51

21CS54(AI/ML) Prof.

Thameeza

MODULE 4

DECISION TREE LEARNING & BAYESIAN LEARNING


Introduction

 Why called as decision tree?


 As starts from root node and finds number of solutions.
 The benefits of having a decision tree are as follows :
 It does not require any domain knowledge.
 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.
 Example : Toll free number

Structure of a Decision Tree


 A decision tree is a structure that includes a root node, branches, and leaf nodes.
Each internal node denotes a test on an attribute, each branch denotes the
outcome of a test, and each leaf node holds a class label.
 The topmost node in the tree is the root node and Applies to classification and
regression model.

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

 Figure 6.1 shows symbols that are used to represent different nodes in
the construction of a decision trees.
 Decision networks are also called as influence diagrams.

 The decision tree consists of 2 major procedures:

1) Building a tree and

2) Knowledge inference or classification.

Building the Tree

Knowledge Inference or Classification

Advantages of Decision Trees

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Disadvantages of Decision Trees

Fundamentals of Entropy

 Given a training dataset with a set of attributes, the decision tree is constructed
by finding the attribute that best describes the target class for the given test
instances.
 The best split attribute is the one which contains more information about how to
split the dataset among all features so that target class is accurately identified for
the test instances.
 The splitting should be pure at every stage of selecting the best feature.

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Algorithm 6.1: General Algorithm for Decision


Trees

Decision tree induction algorithms

ID3 Tree Construction(ID3 stands for Iterative Dichotomiser 3 )


 A decision tree is one of the most powerful tools of supervised learning
algorithmsused for both classification and regression tasks.
 It builds a flowchart-like tree structure where each internal node denotes a test
on anattribute, each branch represents an outcome of the test, and each leaf

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

node (terminal node) holds a class label.


 It is constructed by recursively splitting the training data into subsets based on
the values of the attributes until a stopping criterion is met, such as the
maximum depth of the tree or the minimum number of samples required to split
a node.

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Step 1: Calculate the Entropy for the target class ‘Job Offer’.

Entropy_Info (7, 3) =

Iteration 1:
Step 2: Calculate the Entropy info and Gain (Information gain)for each of the attribute
in the training dataset

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Step 3: From table 6.8 choose the attribute for which entropy is minimum and
therefore the gain is maximum as the best split attribute

Iteration 2:

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Example: Construct a decision tree for the training dataset below using
ID3 Algorithm

C4.5 Construction
 C4.5 is a widely used algorithm for constructing decision trees from a dataset.
Disadvantages of ID3 are: Attributes must be nominal values, dataset must
not includemissing data, and finally the algorithm tend to fall into over fitting.
 To overcome this disadvantage Ross Quinlan, inventor of ID3, made some
improvements for these bottlenecks and created a new algorithm named C4.5.
 Now, thealgorithm can create a more generalized models including continuous
data and could handle missing data. And also works with discrete data,
supports post-pruning.

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Example: Construct a decision tree for the table 6.3 using C4.5

Iteration 1:
Step 1: Calculate the Entropy for the target class ‘Job Offer’.

Entropy_Info (7, 3) =

Step 2: Calculate the Entropy info and Gain (Information gain), Split info and gain
ratio for each of the attribute in the training dataset
.

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Step 3: Choose the attribute for which gain ratio is maximum as the best split
attribute

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

The final decision tree is shown in figure below.

Example 2:

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Dealing with Continuous Attributes in C4.5

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

 Similarly, the calculations are done for each of the distinct value for the
attribute CGPA and a table is created. Now, the value of GPA with maximum
gain is chosen as the threshold value or the best split point. From Table 6.13, we
can observe that CGPA with 7.9 has the maximum gain as 0.4462.
 Hence, CGPA € 7.9 is chosen as the split point. Now, we can discretize the
continuous values Of CGPA as two categories with CGPA $ 7.9 and GPA >
7.9. The resulting discretized instances are shown in Table 6.14.

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Classification and Regression Trees Construction(CART)


 Classification and Regression Trees (CART) is a widely used algorithm for
constructing decision trees that can be applied to both classification and
regression tasks. CART is similar to C4.5 but has some differences in its
construction and splittingcriteria.
 It constructs a tree as a binary tree by recursively splitting a node into two
nodes. Therefore even if an attribute has more than two possible values. GINI
index is calculated for all the subsets of attributes and the subset which has
maximum value is selected as the best split subset.

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Example: Construct a decision tree for the table 6.3 using CART

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Regression Trees

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Home Work Problems

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

VALIDATING AND PRUNING OF DECISION TREES

 Validating and pruning decision trees is a crucial part of building accurate and
robust machine learning models.
 Decision trees are prone to over fitting, which means they can learn to capture
noise and details in the training data that do not generalize well to new, unseen
data.
 Validation and pruning are techniques used to mitigate this issue and improve
the performance of decision tree models.
 The pre-pruning technique of Decision Trees is tuning the hyper parameters
prior to the training pipeline. It involves the heuristic known as ‘early stopping’
which stops thegrowth of the decision tree - preventing it from reaching its full
depth.
 It stops the tree-building process to avoid producing leaves with small samples.
During each stage of the splitting of the tree, the cross-validation error will be
monitored.
 If the value of theerror does not decrease anymore - then we stop the growth of
the decision tree.
 The hyper parameters that can be tuned for early stopping and preventing over
fitting are: max_depth, min_samples_leaf, and min_samples_split
 These same parameters can also be used to tune to get a robust model
 Post-pruning does the opposite of pre-pruning and allows the Decision Tree
model to grow to its full depth. Once the model grows to its full depth, tree
branches are removed to prevent the model from over fitting.
 The algorithm will continue to partition data intosmaller subsets until the final
subsets produced are similar in terms of the outcome variable.
 The final subset of the tree will consist of only a few data points allowing the
tree to have learned the data to the T. However, when a new data point is
introduced that differs from the learned data - it may not get predicted well.

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Chapter 8
Bayesian Learning
8.1 Introduction to probability-based learning
8.2 Fundamentals of Bayes theorem
8.3 Classification using Bayes model
8.3.1 Naïve Bayes Algorithm
8.3.2 Brute Force Bayes Algorithm
8.3.3 Bayes optimal classifier
8.3.4 Gibbs Algorithm
8.4 Naïve Bayes Algorithm for continuous attributes
8.5 other popular types of naïve Bayes classifiers

Bayesian learning

 Bayesian Learning is a learning method that describes and represents knowledge


in an uncertain domain and provides a way to reason about this knowledge
using probability measure.
 It uses Bayes theorem to infer the unknown parameters of a model. Bayesian
inference is useful in many applications which involve reasoning and
diagnosis such as game theory, medicine, etc.
 Bayesian inference is much more powerful in handling missing data and
for estimating any uncertainty in predictions.

8.1 Introduction to probability-based learning


 Probability-based learning is one of the most important practical learning
methods which combines prior knowledge or prior probabilities with
observed data.
 Probabilistic learning uses the concept of probability theory that describes
how to model randomness, uncertainty, and noise to predict future events.
 It is a tool for modelling large datasets and uses Bayes rule to infer unknown
quantities, predict and learn from data. In a probabilistic model, randomness
plays a major role which gives probability distribution a solution, while in a
deterministic model there is no randomness and hence it exhibits the same
initial conditions every time the model is run and is likely to get a single
possible outcome as the solution.
 Bayesian learning differs from probabilistic learning as it uses subjective
probabilities (ie, probability that is based on an individual's belief or
interpretation about the outcome of an event and it can change over time) to
infer parameters of a model.
 Two practical learning algorithms called Naive Bayes learning and
Bayesian Belief Network (BBN) form the major part of Bayesian learning.
 These algorithms use prior probabilities and apply Bayes rule to infer
useful information

8.2 Fundamentals of Bayes theorem

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

 Naive Bayes Model relies on Bayes theorem


 It works on the principle of three kinds of probabilities called prior
probability, likelihood probability, and posterior probability.

Prior Probability
 It is the general probability of an uncertain event before an observation is seen
or some evidence is collected.
 It is the initial probability that is believed before any new information
is collected.

Likelihood Probability
 Likelihood probability is the relative probability of the observation occurring
for each class or the sampling density for the evidence given the hypothesis.
 It is stated as P (Evidence I Hypothesis), which denotes the likeliness of
the occurrence of the evidence given the parameters.
Posterior Probability

 It is the updated or revised probability of an event taking into account


the observations from the training data.
 P (Hypothesis I Evidence) is the posterior distribution representing the
belief about the hypothesis, given the evidence from the training data.
Therefore,
Posterior probability = prior probability + new evidence

8.3 Classification using bayes model


 Naive Bayes Classification models work on the principle of Bayes theorem.
 Bayes' rule is a mathematical formula used to determine the posterior probability,
 Given prior probabilities of events. Generally, Bayes theorem is used to select
the most probable hypothesis from data, considering both prior knowledge
and posterior distributions.
 It is based on the calculation of the posterior probability and is stated as:
P (Hypothesis h I Evidence E)
 Where, Hypothesis ‘h’ is the target class to be classified and Evidence ‘E’ is
the given test instance. P (Hypothesis h) is the prior probability of the
hypothesis h without observing the training data or considering any evidence.

 Where, P (Hypothesis h) is the prior probability of the hypothesis h


without observing the training data or considering any evidence.
 It denotes the prior belief or the initial probability that the hypothesis h is correct.
 P (Evidence E) is the prior probability of the evidence E from the training
dataset without any knowledge of which hypothesis holds. It is also called the
marginal probability.

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Maximum A Posteriori (MAP) Hypothesis, hMAP


 Given a set of candidate hypotheses, the hypothesis which has the
maximum value is considered as the maximum probable hypothesis or most
probable hypothesis.
 This most probable hypothesis is called the Maximum A Posteriori
Hypothesis hMAP. Bayes theorem Eq. (8.1) can be used to find the HMAP.

Maximum Likelihood (ML) Hypothesis, hML


 Given a set of candidate hypotheses, if every hypothesis is equally probable,
only P (E \ h) is used to find the most probable hypothesis.
 The hypothesis that gives the maximum likelihood for P (E\h) is called
the Maximum Likelihood (ML) Hypothesis, hML.

Correctness of Bayes theorem

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

 One related concept of Bayes theorem is the principle of Minimum


Description Length (MDL).
 The minimum description length (MDL) principle is yet another
powerful method like Occam's razor principle to perform inductive
inference.
 It states that the best and most probable hypothesis is chosen for a set of
observed data or the one with the minimum description. Recall from Eq. (8.2)
Maximum A Posteriori (MAP) Hypothesis, hwAR which says that given a set
of candidate hypotheses, the hypothesis which has the maximum value is
considered as the maximum probable hypothesis or most probable hypothesis.
 Naive Bayes algorithm uses the Bayes theorem and applies this MDL
principle to find the best hypothesis for a given problem.

8.3.1 Naïve Bayes Algorithm


 It is a supervised binary class or multi class classification algorithm that
works on the principle of Bayes theorem.
 There is a family of Naive Bayes classifiers based on a common principle.
These algorithms classify for datasets whose features are independent and each
feature is assumed to be given equal weightage. It particularly works for a large
dataset and is very fast.
 It is one of the most effective and simple classification algorithms.
 This algorithm considers all features to be independent of each other
even though they are individually dependent on the classified object.
 Each of the features contributes a probability value independently
during classification and hence this algorithm is called as Naive
algorithm.
 Some important applications of these algorithms are text
classification, recommendation system and face recognition.

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

 As explained earlier the Likelihood probability is stated as the sampling


density for the evidence given the hypothesis.
 It is denoted as P (Evidence I Hypothesis), which says how likely the
occurrence of the evidence given the parameters is.
 It is calculated as the number of instances of each attribute value and for a
given class value divided by the number of instances with that class value.
 For example P (CGPA ≥9 I Job Offer = Yes) denotes the number of instances
with 'CGPA 29’ and 'Job Offer = Yes' divided by the total number of
instances with 'Job Offer = Yes'.
 From the Table 8.3 Frequency Matrix of GPA, number of instances with
'CGPA 29' and 'Job Offer = Yes' is 3.
 The total number of instances with Job Offer = Yes' is 7.
 Hence, P (CGPA 29 | Job Offer = Yes) = 3/7.
 Similarly, the Likelihood probability is calculated for all attribute values
of feature CGPA.

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

8.3.2 Brute Force Bayes Algorithm


 Applying Bayes theorem, Brute Force Bayes algorithm relies on the idea of
concept learning wherein given a hypothesis space H for the training dataset
I, the algorithm computes the posterior probabilities for all the hypothesis h,
eH.
 Then, Maximum A Posteriori (MAP) Hypothesis, hmap is used to output
the hypothesis with maximum posterior probability. The algorithm is quite
expensive since it requires computations for all the hypotheses.
 Although computing posterior probabilities is inefficient, this idea is applied
in various other algorithms which is also quite interesting.

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

8.3.3 Bayes Optimal Classifier


 Bayes optimal classifier is a probabilistic model, which in fact, uses the
Bayes theorem to find the most probable classification for a new instance
given the training data by combining the predictions of all posterior
hypotheses,
 This is different from Maximum A Posteriori (MAP) Hypothesis, Krap
which chooses the maximum probable hypothesis or the most probable
hypothesis.
 Here, a new instance can be classified to a possible classification value C, by
the following

 Hmap chooses h1 which has the maximum probability value 0.3 has the
solution and gives result that the patient is covid negative.
 But Bayes Optimal Classifier combines the predictions of h2, h3 and h4 which is
0.4 and gives the result that the patient is covid negative.

 Therefore max Ci ∈ [COVID Positive, Covid Negative] Σhi ∈H P(Ci\


hi)P(hi\T) = COVID Positive
 Thus the algorithm, diagnoses the new instance to be COVID Positive

8.3.4 Gibbs Algorithm


 The main drawback of Bayes optimal classifier is that it computes the
posterior probability foral hypotheses in the hypothesis space and then
combines the predictions to classity a new instance.

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

 Gibbs algorithm is a sampling technique which randomly selects a hypothesis


from the hypothesis space according to the posterior probability distribution
and classifies a new instance
 It is found that the prediction error occurs twice with the Gibbs algorithm
when compared to Bayes Optimal classifier.

8.4 Naive Bayes Algorithm For Continuous Attributes


 There are two ways to predict with Naive Bayes algorithm for
continuous attributes:
1. Discretize continuous feature to discrete feature.
2. Apply Normal or Gaussian distribution for continuous feature.

Gaussian Naive Bayes Algorithm


 In Gaussian Naive Bayes, the values of continuous features are assumed to
be sampled from a Gaussian distribution.

Example 8.4:
Assess a student's performance using Naïve Bayes algorithm for the continuous
attribute. Predict whether a student gets a job offer or not in his final year of the
course. The training dataset T consists of 10 data instances with attributes such
as 'CGA' and 'Interactiveness' as shown in Table 8.13. The target variable is Job
Offer which is classified as Yes or No for a candidate student.

Solution:
Step 1: Compute the prior probability for the target feature Job offer

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

Altedegree.co
m
21CS54(AI/ML) Prof. Thameeza

8.5 OTHER POPULAR TYPES OF NAIVE BAYES CLASSIFIERS


Some of the popular variants of Bayesian classifier are listed below:

 Bernoulli Naive Bayes Classifier: Bernoulli Naive Bayes works with discrete
features. In this algorithm, the features used for making predictions are
Boolean variables that take only two values either 'yes' or 'no'. This is
particularly useful for text classification where all features are binary with each
feature containing two values whether the word occurs or not.
 Multinomial Naive Bayes Classifier: This algorithm is a generalization of the
Bernoulli Naive Bayes model that works for categorical data or particularly
integer features. This classifier is useful for text classification where each
feature will have an integer value that represents the frequency of occurrence of
words.
 Multi-class Naive Bayes Classifier: This algorithm is useful for classification
problems with more than two classes where the target feature contains
multiple classes and test instance has to be predicted with the class it belongs
to.

Altedegree.co
m

You might also like