Decision Tree
Algorithm
Supervised ML
What is a Decision Tree?
Decision tree is a type of supervised learning algorithm (having a
predefined target variable) that is mostly used in classification
problems. It works for both categorical and continuous input and
output variables. In this technique, we split the population or sample
into two or more homogeneous sets (or sub-populations) based on
most significant splitter / differentiator in input variables.
Structure of a Decision Tree
Structure of a Decision Tree
Root Node: It represents entire population or sample and this further
gets divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is
called decision node.
Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node
Branch / Sub-Tree: A sub section of entire tree is called branch or sub-
tree.
Parent and Child Node: A node, which is divided into sub-nodes is
called parent node of sub-nodes where as sub-nodes are the child of
parent node.
How does the Decision Tree
Algorithm Work?
The basic idea behind any decision tree algorithm is as follows:
Select the best attribute using Attribute Selection Measures (ASM) to
split the records.
Make that attribute a decision node and breaks the dataset into
smaller subsets.
Start tree building by repeating this process recursively for each child
until one of the condition will match:
All the tuples belong to the same attribute value.
There are no more remaining attributes.
There are no more instances.
How does the Decision Tree
Algorithm Work?
Attribute Selection Measures
Attribute selection measure is a heuristic for selecting the splitting criterion that
partition data into the best possible manner. It is also known as splitting rules
because it helps us to determine breakpoints for tuples on a given node. ASM
provides a rank to each feature(or attribute) by explaining the given dataset.
Best score attribute will be selected as a splitting attribute (Source). In the case
of a continuous-valued attribute, split points for branches also need to define.
Most popular selection measures are:
Entropy
Gini Index
Chi-Square
Gain Ratio.
What is Entropy?
Entropy is a measure of the uncertainty or impurity in a dataset. It quantifies the
amount of disorder or randomness. In the context of a decision tree, entropy helps
to determine how informative a particular split is.
High Entropy: Indicates high disorder, meaning the data is diverse and
uncertain.
Low Entropy: Indicates low disorder, meaning the data is more homogeneous
and certain.
The formula for entropy H for a binary classification problem is:
H(S)=−p +log 2(p +)−p −log 2(p −)
where:
p + is the proportion of positive examples in the dataset S
p − is the proportion of negative examples in the dataset S
What is Information Gain?
Information Gain (IG) is a measure of the effectiveness of an attribute in
classifying the training data. It quantifies the reduction in entropy (uncertainty)
achieved by splitting the dataset based on an attribute.
The formula for Information Gain is:
Gain(S, A) = Entropy(S) - ((|Sv| / |S|) * Entropy(Sv))
Where:
• S is the original dataset
• A is the attribute
• Svis the subset of S for which attribute A has value v
• H(S) is the entropy of the original dataset
• H(Sv) is the entropy of the subset
Gini index
Another decision tree algorithm CART (Classification and Regression
Tree) uses the Gini method to create split points.
Where, pi is the probability that a tuple in D belongs to class Ci.
The Gini Index considers a binary split for each attribute. You can
compute a weighted sum of the impurity of each partition. If a binary
split on attribute A partitions data D into D1 and D2, the Gini index of
D is:
Decision tree algorithms:-
CART (Classification and Regression Trees) → uses Gini
Index(Classification) as metric.
ID3 (Iterative Dichotomiser 3) → uses Entropy function and
Information gain as metrics.
Information Gain:
By using information gain as a criterion, we try to estimate the
information contained by each attribute. We are going to use some
points deducted from information theory.
To measure the randomness or uncertainty of a random variable X is
defined by Entropy.
For a binary classification problem with only two classes, positive and
negative class.
If all examples are positive or all are negative then entropy will be
zero i.e, low.
If half of the records are of positive class and half are of negative
class then entropy is one i.e, high.
By calculating entropy measure of each attribute we can calculate
their information gain. Information Gain calculates the expected
reduction in entropy due to sorting on the attribute.
Entropy can be calculated using formula:-
Here p and q is probability of success and failure respectively in that node.
Entropy is also used with categorical target variable. It chooses the split which
has lowest entropy compared to parent node and other splits. The lesser the
entropy, the better it is.
Steps to calculate entropy for a split:
Calculate entropy of parent node
Calculate entropy of each individual node of split and calculate weighted
average of all sub-nodes available in split.
PROCEDURE
First the entropy of the total dataset is calculated.
The dataset is then split on the different attributes.
The entropy for each branch is calculated. Then it is added
proportionally, to get total entropy for the split.
The resulting entropy is subtracted from the entropy before
the split.
The result is the Information Gain, or decrease in entropy.
The attribute that yields the largest IG is chosen for the decision
node.
EXAMPLE
how can we avoid over-fitting in decision
trees?
Overfitting is a practical problem while building a decision tree model.
The model is having an issue of overfitting is considered when the
algorithm continues to go deeper and deeper in the to reduce the
training set error but results with an increased test set error i.e,
Accuracy of prediction for our model goes down. It generally happens
when it builds many branches due to outliers and irregularities in
data.
Two approaches which we can use to avoid overfitting are:
Pre-Pruning
Post-Pruning
Pre-Pruning
In pre-pruning, it stops the tree construction bit early. It is preferred
not to split a node if its goodness measure is below a threshold value.
But it’s difficult to choose an appropriate stopping point.
Post-Pruning
In post-pruning first, it goes deeper and deeper in the tree to build a
complete tree. If the tree shows the overfitting problem then pruning
is done as a post-pruning step. We use a cross-validation data
to check the effect of our pruning. Using cross-validation data, it tests
whether expanding a node will make an improvement or not.
If it shows an improvement, then we can continue by expanding that node.
But if it shows a reduction in accuracy then it should not be expanded i.e, the
node should be converted to a leaf node.