UNIT – III : Syllabus
Models Based on Decision Trees:
1. Decision Trees for Classification
2. Impurity Measures
3. Properties
4. Regression Based on Decision Trees ( Decision Trees for Regression )
5. Bias–Variance Trade-off
6. Random Forests for Classification and Regression
The Bayes Classifier:
1. Introduction to the Bayes Classifier
2. Bayes’ Rule and Inference
3. The Bayes Classifier and its Optimality
4. Multi-Class Classification
5. Class Conditional Independence
6. Naive Bayes Classifier (NBC)
Decision Tree for Classification
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning
model. Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-
like structure.
Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The
complete process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select
the best attribute for the nodes of the tree. There are two popular techniques for ASM, which
are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision
tree.
o A decision tree algorithm always tries to maximize the value of information gain, and
a node/attribute having the highest information gain is split first. It can be calculated
using the below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:
Gini Index= 1- ∑j Pj 2
Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:
o Cost Complexity Pruning
o Reduced Error Pruning.
Advantages of the Decision Tree
o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
Decision Tree - ID3 Algorithm - Solved Numerical Example-1
Impurity Measures
Therefore
Properties of Impurity Measures
Decision Tree for Regression
A Decision Tree for Regression is a type of supervised learning algorithm used when the
target variable is continuous (numeric), rather than categorical. Instead of predicting class
labels, regression trees predict a real-valued output.
How Regression Trees Work
1. Splitting:
o The dataset is split into subsets based on feature values.
o The split is chosen to minimize a measure of error (commonly Mean Squared
Error (MSE) or Mean Absolute Error (MAE)) between predicted and actual
values.
2. Recursive Partitioning:
o The process is repeated recursively, creating a tree with decision rules at each
node.
o Each split divides the input space into smaller, more homogeneous regions.
3. Prediction:
o For a new input, the tree traverses from the root to a leaf node based on feature
conditions.
o The prediction at the leaf node is usually the mean (or median) of the target
values of training samples in that node.
Example
Suppose you want to predict house price based on features like size and location.
The tree might first split on size > 2000 sqft.
Then, within each group, it might split on location = urban or rural.
Each leaf will output the average house price for the training samples in that region.
Objective Function
Advantages
Easy to interpret and visualize
Can capture nonlinear relationships
No need for feature scaling
Disadvantages
Can overfit (trees grow too deep)
High variance (small data changes can alter structure)
Not as accurate as ensemble methods (e.g., Random Forests, Gradient Boosted Trees)
Bias,Variance ,Trade-off
1. It is important to understand prediction errors (bias and variance).
2. Tradeoff referred best solution for selecting a value called Regularization
constant.
3. Proper understanding of these errors would help to avoid the overfitting and
underfitting.
Bias:
1. The bias is known as the difference between the prediction value and correct
value.
2. High Bias gives a large error in training as well as testing data.
3. An algorithm should always be low bias to avoid the underfitting.
4. By high bias, the predicted data is in a straight line format.
5. Such fitting is known as Underfitting of Data.
6. This happens when the hypothesis is too simple or linear.
HighBias
Linear hypothesis looks like bellow.
Variance:
1. The variability of model prediction for given data point is called the variance of
the model.
2. The model with high variance denotes very complex to fit to the training data.
3. The model perform very well on training data but has high error rate on test
data.
4. When a model is high on variance, then it is said to be Overfitting of Data.
5. While training a data model variance should be low.
6. The high variance data looks like bellow
High Variance
The hypothesis looks like bellow.
Trade-off:
1. If the algorithm is too simple (hypothesis with linear eq.) then it may be on high
bias and low variance condition and thus is error-prone.
2. If algorithms fit too complex ( hypothesis with high degree eq.) then it may be
on high variance and low bias.
3. There is something between both of these conditions is known as Trade-off or
Bias Variance Trade-off.
4. For the graph, the perfect tradeoff will be like.
Model complexity graph shows trade-off like bellow :
This is referred to as the best point for the training of the algorithm which gives low
error in training as well as testing data.
The bias value and variance value should be minimum to fed the data to Algorithm
In best fit area the algorithm performance will be good.
Naive Bayes Classifier Algorithm
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training
dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naive Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
Why is it called Naive Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
o Naive: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple
without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
o The formula for Bayes' theorem is given as:
Outlook Play
1 Overcast Yes
2 Sunny Yes
3 Overcast Yes
4 Overcast Yes
5 Sunny No
6 Rainy Yes
7 Sunny Yes
8 Overcast Yes
9 Rainy No
10 Sunny No
11 Sunny Yes
12 Rainy Yes
13 Overcast Yes
14 Rainy No
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naive Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
Frequency table for the Weather Conditions:
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4
Likelihood table weather condition:
Weather No Yes
Overcast 0 5 5/14= 0.35
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Advantages of Naive Bayes Classifier:
o Naive Bayes is one of the fast and easy ML algorithm to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions with compared to the other Algorithms.
o It is the most popular choice for text classification problems.
Disadvantages of Naive Bayes Classifier:
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.
Applications of Naive Bayes Classifier:
o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naive Bayes Classifier is an eager
learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.