MODULE III
DECISION TREE
A Decision tree is a flowchart like tree structure, where
▪ each internal node denotes a test on an attribute,
▪ each branch represents an outcome of the test,
▪ each leaf node (terminal node) holds a class label.
A decision tree is a hierarchical model for supervised learning
whereby the local region is identified in a sequence of recursive splits
in a smaller number of steps.
▪ A decision tree is composed of internal decision nodes and
terminal leaves.
▪ Each decision node m implements a test function fm(x) with
discrete outcomes labeling the branches.
▪ Given an input, at each node, a test is applied and one of the
branches is taken depending on the outcome.
▪ This process starts at the root and is repeated recursively until a
leaf node is hit, at which point the value written in the leaf
constitutes the output.
Feature Selection Method
If a dataset consists of n attributes then deciding which attribute is to
be placed at the root or at different levels of the tree as internal nodes
is a complicated problem.
▪ The most important problem in implementing the decision tree
algorithm is deciding which features are to be considered as the
root node and at each level
Popular feature selection measures are
•Information gain
•Gini index
•Gain Ratio
Entropy
The degree to which a subset of examples contains only a single class
is known as purity and any subset composed of only a single class is
called a pure class
▪ Informally, entropy is a measure of “impurity”in a dataset
▪ Entropy is measured in bits.
▪ If there are only two possible classes, entropy values can range
from 0 to 1.
▪ For n classes, entropy ranges from 0 to log2(n).
▪ In each case, the minimum value indicates that the sample is
completely homogeneous, while the maximum value indicates
that the data are as diverse as possible
Entropy is a measure of the randomness in the information being
processed.
▪ The higher the entropy, the harder it is to draw any conclusions
from that information.
▪ Consider a segment S of a dataset having c number of class
labels.
▪ Let pi be the proportion of examples in S having the ith class
label.
▪ The entropy of S is defined as:
Information Gain
▪ Information gain tells how important a given attribute of the
feature vectors is.
▪ Used to decide the ordering of attributes in the nodes of a
decision tree.
▪ Let S be a set of examples, A be a feature (or, an attribute), Sv
be the subset of S with
▪ A=v
▪ Values (A) be the set of all possible values of A.
▪ Then the information gain of an attribute A relative to the set S,
denoted by Gain (S,A), is defined as
▪ Example of Information Gain
▪ Consider the previous data for target concept “Play Tennis”.
▪ Calculation of Gain (S, outlook)
▪ The values of the attribute “outlook” are “sunny”, “ overcast”
and “rain”.
▪ Calculate Entropy (Sv) for v = sunny, v = overcast and v = rain.
ID3 Algorithm
▪ Algorithm used to generate a decision tree.
▪ The ID3 algorithm was invented by Ross Quinlan.
▪ The ID3 follows the Occam’s razor principle.
▪ Attempts to create the smallest possible decision tree.
Step 1: Create a root node for the tree.
Step 2: Note that not all examples are positive (class label “yes”) and
not all examples are negative (class label “no”). Also the number of
features is not zero.
Step 3: Decide which feature is to be placed at the root node.
For this, calculate the information gains corresponding to each of the
four features.
Step 4: Find the highest information gain which is the maximum
among Gain(S, outlook), Gain(S, temperature), Gain(S, humidity)
and Gain(S, wind).
•Highest information gain = max{0.2469, 0.0293, 0.151, 0.048} =
0.2469
•This corresponds to the feature “outlook”.
•Therefore, place “outlook” at the root node.
Step 5:
Step6: all the examples in the dataset corresponding to Node4 in
figure have the same class label “no” and all the examples
corresponding to Node5 have the same class label “yes”.
•So represent Node 4 as a leaf node with value “no” and Node 5 as a
leaf node with value “yes”.
•Similarly, all the examples corresponding to Node2 have the same
class label “yes”. So convert Node2 as a leaf node with value “yes”.
•Finally, letS(2)=Soutlook=rain. The highest information gain among
Gain(S(2),temperature), Gain(S(2),humidity) and
Gain(S(2),wind)=max{0.02,0.02,0.9710} for this dataset is
Gain(S(2);wind).
•The branches resulting from splitting this node corresponding to the
values “weak” and “strong” of “wind” lead to leaf nodes with class
labels “yes” and ”no”.
Gini Index
•Consider a data set S having r class labels c1,……cr.
•Let pi be the proportion of examples having the class label ci.
•The Gini index of the data set S, denoted by Gini(S), is defined by:
Gain Ratio
Let S be a set of examples, A be a feature having c different values
and let the set of values of A be denoted by Values(A).
•The information gain of A relative to S, denoted by Gain(S,A) is
•The split information of A relative to S, denoted by Split
Information(S,A)
•where S1,……,Sc are the c subsets of examples resulting from
partitioning S into the c values of the attribute A.
The gain ratio of A relative to S, is defined as
Surrogate Splits
▪ Surrogate splits are used to classify test samples having missing
values.
▪ Surrogate splits helps to identify the actual splits
▪ Another decision tree is created to predict the split
▪ The number of surrogates that can be used depends on the training
data
Ensemble Methods
Ensemble methods, which combines several decision trees to produce
better predictive performance than utilizing a single decision tree. The
main principle behind the ensemble model is that a group of weak
learners come together to form a strong learner.
Techniques to perform ensemble decision trees:
1. Bagging
2. Boosting
Bagging :
▪ Is used when our goal is to reduce the variance of a decision
tree.
▪ Each tree is built from a subset of the training dataset by random
sampling the training dataset with replacement.
▪ This technique is also called bootstrapping
▪ When training the decision trees on the bootstrapped samples, the
goal is to build very deep overfitting trees. Each tree thus has very
high variance. As multiple models are used for the final prediction
in the ensemble, the overall outcome is a model with low bias.
This is because if many overfitting models are averaged out, the
result is a model that does not overfit to any one sample so the
ensemble has low bias and lower variance. So, while each tree is
a robust modelling of a data sample, the united model ensemble
is more powerful.
Boosting:
▪ Boosting means that each tree is dependent on prior trees.
▪ The algorithm learns by fitting the residual of the trees that
preceded it.
▪ Thus, boosting in a decision tree ensemble tends to improve
accuracy with some small risk of less coverage.
▪ In boosting each tree is dependent on prior trees.