Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views51 pages

Module 4

The document discusses decision tree learning and construction algorithms such as ID3, C4.5, and CART. It covers topics like entropy, information gain, gain ratio, and dealing with continuous attributes. Examples are provided to demonstrate constructing decision trees using different algorithms.

Uploaded by

carry9224
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views51 pages

Module 4

The document discusses decision tree learning and construction algorithms such as ID3, C4.5, and CART. It covers topics like entropy, information gain, gain ratio, and dealing with continuous attributes. Examples are provided to demonstrate constructing decision trees using different algorithms.

Uploaded by

carry9224
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

21CS54(AI/ML) Prof.

Thameeza

MODULE 4

DECISION TREE LEARNING & BAYESIAN LEARNING


Introduction

• Why called as decision tree?


• As starts from root node and finds number of solutions.
• The benefits of having a decision tree are as follows :
• It does not require any domain knowledge.
• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.
• Example : Toll free number

Structure of a Decision Tree


• A decision tree is a structure that includes a root node, branches, and leaf nodes.
Each internal node denotes a test on an attribute, each branch denotes the
outcome of a test, and each leaf node holds a class label.
• The topmost node in the tree is the root node and Applies to classification and
regression model.

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

• Figure 6.1 shows symbols that are used to represent different nodes in the
construction of a decision trees.
• Decision networks are also called as influence diagrams.

• The decision tree consists of 2 major procedures:

1) Building a tree and

2) Knowledge inference or classification.

Building the Tree

Knowledge Inference or Classification

Advantages of Decision Trees

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Disadvantages of Decision Trees

Fundamentals of Entropy

• Given a training dataset with a set of attributes, the decision tree is constructed
by finding the attribute that best describes the target class for the given test
instances.
• The best split attribute is the one which contains more information about how to
split the dataset among all features so that target class is accurately identified for
the test instances.
• The splitting should be pure at every stage of selecting the best feature.

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Algorithm 6.1: General Algorithm for Decision Trees

Decision tree induction algorithms

ID3 Tree Construction(ID3 stands for Iterative Dichotomiser 3 )


• A decision tree is one of the most powerful tools of supervised learning
algorithmsused for both classification and regression tasks.
• It builds a flowchart-like tree structure where each internal node denotes a test
on anattribute, each branch represents an outcome of the test, and each leaf

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

node (terminal node) holds a class label.


• It is constructed by recursively splitting the training data into subsets based on
the values of the attributes until a stopping criterion is met, such as the
maximum depth of the tree or the minimum number of samples required to split
a node.

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Step 1: Calculate the Entropy for the target class ‘Job Offer’.

Entropy_Info (7, 3) =

Iteration 1:
Step 2: Calculate the Entropy info and Gain (Information gain)for each of the attribute
in the training dataset

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Step 3: From table 6.8 choose the attribute for which entropy is minimum and
therefore the gain is maximum as the best split attribute

Iteration 2:

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Example: Construct a decision tree for the training dataset below using ID3
Algorithm

C4.5 Construction
• C4.5 is a widely used algorithm for constructing decision trees from a dataset.
Disadvantages of ID3 are: Attributes must be nominal values, dataset must not
includemissing data, and finally the algorithm tend to fall into over fitting.
• To overcome this disadvantage Ross Quinlan, inventor of ID3, made some
improvements for these bottlenecks and created a new algorithm named C4.5.
• Now, the algorithm can create a more generalized models including continuous
data and could handle missing data. And also works with discrete data,
supports post-pruning.

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Example: Construct a decision tree for the table 6.3 using C4.5

Iteration 1:
Step 1: Calculate the Entropy for the target class ‘Job Offer’.

Entropy_Info (7, 3) =

Step 2: Calculate the Entropy info and Gain (Information gain), Split info and gain
ratio for each of the attribute in the training dataset
.

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Step 3: Choose the attribute for which gain ratio is maximum as the best split
attribute

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

The final decision tree is shown in figure below.

Example 2:

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Dealing with Continuous Attributes in C4.5

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

• Similarly, the calculations are done for each of the distinct value for the attribute
CGPA and a table is created. Now, the value of GPA with maximum gain is
chosen as the threshold value or the best split point. From Table 6.13, we can
observe that CGPA with 7.9 has the maximum gain as 0.4462.
• Hence, CGPA € 7.9 is chosen as the split point. Now, we can discretize the
continuous values Of CGPA as two categories with CGPA $ 7.9 and GPA > 7.9.
The resulting discretized instances are shown in Table 6.14.

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Classification and Regression Trees Construction(CART)


• Classification and Regression Trees (CART) is a widely used algorithm for
constructing decision trees that can be applied to both classification and
regression tasks. CART is similar to C4.5 but has some differences in its
construction and splittingcriteria.
• It constructs a tree as a binary tree by recursively splitting a node into two nodes.
Therefore even if an attribute has more than two possible values. GINI index is
calculated for all the subsets of attributes and the subset which has maximum
value is selected as the best split subset.

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Example: Construct a decision tree for the table 6.3 using CART

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Regression Trees

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Home Work Problems

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

VALIDATING AND PRUNING OF DECISION TREES

• Validating and pruning decision trees is a crucial part of building accurate and
robust machine learning models.
• Decision trees are prone to over fitting, which means they can learn to capture
noise and details in the training data that do not generalize well to new, unseen
data.
• Validation and pruning are techniques used to mitigate this issue and improve
the performance of decision tree models.
• The pre-pruning technique of Decision Trees is tuning the hyper parameters
prior to the training pipeline. It involves the heuristic known as ‘early stopping’
which stops the growth of the decision tree - preventing it from reaching its full
depth.
• It stops the tree- building process to avoid producing leaves with small samples.
During each stage of the splitting of the tree, the cross-validation error will be
monitored.
• If the value of the error does not decrease anymore - then we stop the growth of
the decision tree.
• The hyper parameters that can be tuned for early stopping and preventing over
fitting are: max_depth, min_samples_leaf, and min_samples_split
• These same parameters can also be used to tune to get a robust model
• Post-pruning does the opposite of pre-pruning and allows the Decision Tree
model to grow to its full depth. Once the model grows to its full depth, tree
branches are removed to prevent the model from over fitting.
• The algorithm will continue to partition data into smaller subsets until the final
subsets produced are similar in terms of the outcome variable.
• The final subset of the tree will consist of only a few data points allowing the
tree to have learned the data to the T. However, when a new data point is
introduced that differs from the learned data - it may not get predicted well.

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Chapter 8
Bayesian Learning
8.1 Introduction to probability-based learning
8.2 Fundamentals of Bayes theorem
8.3 Classification using Bayes model
8.3.1 Naïve Bayes Algorithm
8.3.2 Brute Force Bayes Algorithm
8.3.3 Bayes optimal classifier
8.3.4 Gibbs Algorithm
8.4 Naïve Bayes Algorithm for continuous attributes
8.5 other popular types of naïve Bayes classifiers

Bayesian learning

• Bayesian Learning is a learning method that describes and represents knowledge


in an uncertain domain and provides a way to reason about this knowledge using
probability measure.
• It uses Bayes theorem to infer the unknown parameters of a model. Bayesian
inference is useful in many applications which involve reasoning and diagnosis
such as game theory, medicine, etc.
• Bayesian inference is much more powerful in handling missing data and for
estimating any uncertainty in predictions.

8.1 Introduction to probability-based learning


• Probability-based learning is one of the most important practical learning
methods which combines prior knowledge or prior probabilities with observed
data.
• Probabilistic learning uses the concept of probability theory that describes how
to model randomness, uncertainty, and noise to predict future events.
• It is a tool for modelling large datasets and uses Bayes rule to infer unknown
quantities, predict and learn from data. In a probabilistic model, randomness
plays a major role which gives probability distribution a solution, while in a
deterministic model there is no randomness and hence it exhibits the same initial
conditions every time the model is run and is likely to get a single possible
outcome as the solution.
• Bayesian learning differs from probabilistic learning as it uses subjective
probabilities (ie, probability that is based on an individual's belief or
interpretation about the outcome of an event and it can change over time) to infer
parameters of a model.
• Two practical learning algorithms called Naive Bayes learning and Bayesian
Belief Network (BBN) form the major part of Bayesian learning.
• These algorithms use prior probabilities and apply Bayes rule to infer useful
information

8.2 Fundamentals of Bayes theorem

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

• Naive Bayes Model relies on Bayes theorem


• It works on the principle of three kinds of probabilities called prior probability,
likelihood probability, and posterior probability.

Prior Probability
• It is the general probability of an uncertain event before an observation is seen or
some evidence is collected.
• It is the initial probability that is believed before any new information is
collected.

Likelihood Probability
• Likelihood probability is the relative probability of the observation occurring for
each class or the sampling density for the evidence given the hypothesis.
• It is stated as P (Evidence I Hypothesis), which denotes the likeliness of the
occurrence of the evidence given the parameters.
Posterior Probability

• It is the updated or revised probability of an event taking into account the


observations from the training data.
• P (Hypothesis I Evidence) is the posterior distribution representing the belief
about the hypothesis, given the evidence from the training data. Therefore,
Posterior probability = prior probability + new evidence

8.3 Classification using bayes model


• Naive Bayes Classification models work on the principle of Bayes theorem.
• Bayes' rule is a mathematical formula used to determine the posterior probability,
• Given prior probabilities of events. Generally, Bayes theorem is used to select
the most probable hypothesis from data, considering both prior knowledge and
posterior distributions.
• It is based on the calculation of the posterior probability and is stated as:
P (Hypothesis h I Evidence E)
• Where, Hypothesis ‘h’ is the target class to be classified and Evidence ‘E’ is the
given test instance. P (Hypothesis h) is the prior probability of the hypothesis h
without observing the training data or considering any evidence.

• Where, P (Hypothesis h) is the prior probability of the hypothesis h without


observing the training data or considering any evidence.
• It denotes the prior belief or the initial probability that the hypothesis h is correct.
• P (Evidence E) is the prior probability of the evidence E from the training dataset
without any knowledge of which hypothesis holds. It is also called the marginal
probability.

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Maximum A Posteriori (MAP) Hypothesis, hMAP


• Given a set of candidate hypotheses, the hypothesis which has the maximum
value is considered as the maximum probable hypothesis or most probable
hypothesis.
• This most probable hypothesis is called the Maximum A Posteriori Hypothesis
hMAP. Bayes theorem Eq. (8.1) can be used to find the HMAP.

Maximum Likelihood (ML) Hypothesis, hML


• Given a set of candidate hypotheses, if every hypothesis is equally probable, only
P (E \ h) is used to find the most probable hypothesis.
• The hypothesis that gives the maximum likelihood for P (E\h) is called the
Maximum Likelihood (ML) Hypothesis, hML.

Correctness of Bayes theorem

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

• One related concept of Bayes theorem is the principle of Minimum Description


Length (MDL).
• The minimum description length (MDL) principle is yet another powerful
method like Occam's razor principle to perform inductive inference.
• It states that the best and most probable hypothesis is chosen for a set of
observed data or the one with the minimum description. Recall from Eq. (8.2)
Maximum A Posteriori (MAP) Hypothesis, hwAR which says that given a set of
candidate hypotheses, the hypothesis which has the maximum value is
considered as the maximum probable hypothesis or most probable hypothesis.
• Naive Bayes algorithm uses the Bayes theorem and applies this MDL principle
to find the best hypothesis for a given problem.

8.3.1 Naïve Bayes Algorithm


• It is a supervised binary class or multi class classification algorithm that works
on the principle of Bayes theorem.
• There is a family of Naive Bayes classifiers based on a common principle. These
algorithms classify for datasets whose features are independent and each feature
is assumed to be given equal weightage. It particularly works for a large dataset
and is very fast.
• It is one of the most effective and simple classification algorithms.
• This algorithm considers all features to be independent of each other even
though they are individually dependent on the classified object.
• Each of the features contributes a probability value independently during
classification and hence this algorithm is called as Naive algorithm.
• Some important applications of these algorithms are text classification,
recommendation system and face recognition.

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

• As explained earlier the Likelihood probability is stated as the sampling density


for the evidence given the hypothesis.
• It is denoted as P (Evidence I Hypothesis), which says how likely the occurrence
of the evidence given the parameters is.
• It is calculated as the number of instances of each attribute value and for a given
class value divided by the number of instances with that class value.
• For example P (CGPA ≥9 I Job Offer = Yes) denotes the number of instances
with 'CGPA 29’ and 'Job Offer = Yes' divided by the total number of instances
with 'Job Offer = Yes'.
• From the Table 8.3 Frequency Matrix of GPA, number of instances with 'CGPA
29' and 'Job Offer = Yes' is 3.
• The total number of instances with Job Offer = Yes' is 7.
• Hence, P (CGPA 29 | Job Offer = Yes) = 3/7.
• Similarly, the Likelihood probability is calculated for all attribute values of
feature CGPA.

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

8.3.2 Brute Force Bayes Algorithm


• Applying Bayes theorem, Brute Force Bayes algorithm relies on the idea of
concept learning wherein given a hypothesis space H for the training dataset I,
the algorithm computes the posterior probabilities for all the hypothesis h, eH.
• Then, Maximum A Posteriori (MAP) Hypothesis, hmap is used to output the
hypothesis with maximum posterior probability. The algorithm is quite
expensive since it requires computations for all the hypotheses.
• Although computing posterior probabilities is inefficient, this idea is applied in
various other algorithms which is also quite interesting.

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

8.3.3 Bayes Optimal Classifier


• Bayes optimal classifier is a probabilistic model, which in fact, uses the Bayes
theorem to find the most probable classification for a new instance given the
training data by combining the predictions of all posterior hypotheses,
• This is different from Maximum A Posteriori (MAP) Hypothesis, Krap which
chooses the maximum probable hypothesis or the most probable hypothesis.
• Here, a new instance can be classified to a possible classification value C, by the
following

• Hmap chooses h1 which has the maximum probability value 0.3 has the solution
and gives result that the patient is covid negative.
• But Bayes Optimal Classifier combines the predictions of h2, h3 and h4 which is
0.4 and gives the result that the patient is covid negative.

• Therefore max Ci ∈ [COVID Positive, Covid Negative] Σhi ∈H P(Ci\hi)P(hi\T) =


COVID Positive
• Thus the algorithm, diagnoses the new instance to be COVID Positive

8.3.4 Gibbs Algorithm


• The main drawback of Bayes optimal classifier is that it computes the posterior
probability foral hypotheses in the hypothesis space and then combines the
predictions to classity a new instance.

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

• Gibbs algorithm is a sampling technique which randomly selects a hypothesis


from the hypothesis space according to the posterior probability distribution and
classifies a new instance
• It is found that the prediction error occurs twice with the Gibbs algorithm when
compared to Bayes Optimal classifier.

8.4 Naive Bayes Algorithm For Continuous Attributes


• There are two ways to predict with Naive Bayes algorithm for continuous
attributes:
1. Discretize continuous feature to discrete feature.
2. Apply Normal or Gaussian distribution for continuous feature.

Gaussian Naive Bayes Algorithm


• In Gaussian Naive Bayes, the values of continuous features are assumed to be
sampled from a Gaussian distribution.

Example 8.4:
Assess a student's performance using Naïve Bayes algorithm for the continuous
attribute. Predict whether a student gets a job offer or not in his final year of the
course. The training dataset T consists of 10 data instances with attributes such
as 'CGA' and 'Interactiveness' as shown in Table 8.13. The target variable is Job
Offer which is classified as Yes or No for a candidate student.

Solution:
Step 1: Compute the prior probability for the target feature Job offer

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

Altedegree.com
21CS54(AI/ML) Prof. Thameeza

8.5 OTHER POPULAR TYPES OF NAIVE BAYES CLASSIFIERS


Some of the popular variants of Bayesian classifier are listed below:

• Bernoulli Naive Bayes Classifier: Bernoulli Naive Bayes works with discrete
features. In this algorithm, the features used for making predictions are Boolean
variables that take only two values either 'yes' or 'no'. This is particularly useful
for text classification where all features are binary with each feature containing
two values whether the word occurs or not.
• Multinomial Naive Bayes Classifier: This algorithm is a generalization of the
Bernoulli Naive Bayes model that works for categorical data or particularly
integer features. This classifier is useful for text classification where each feature
will have an integer value that represents the frequency of occurrence of words.
• Multi-class Naive Bayes Classifier: This algorithm is useful for classification
problems with more than two classes where the target feature contains multiple
classes and test instance has to be predicted with the class it belongs to.

Altedegree.com

You might also like