Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
22 views82 pages

Decision Tree

Tree-based algorithms, including decision trees, random forests, and gradient boosting, are popular supervised learning methods known for their high accuracy and interpretability. Decision trees specifically classify data by splitting it into homogeneous subsets based on significant input variables, and they can handle both categorical and continuous variables. Key concepts include root nodes, splitting, pruning, and various algorithms like Gini, Chi-square, and entropy, which help determine the best splits for improving model accuracy.

Uploaded by

diksha.mandal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views82 pages

Decision Tree

Tree-based algorithms, including decision trees, random forests, and gradient boosting, are popular supervised learning methods known for their high accuracy and interpretability. Decision trees specifically classify data by splitting it into homogeneous subsets based on significant input variables, and they can handle both categorical and continuous variables. Key concepts include root nodes, splitting, pruning, and various algorithms like Gini, Chi-square, and entropy, which help determine the best splits for improving model accuracy.

Uploaded by

diksha.mandal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 82

Decision Tree

Introduction to Tree Based Algorithm


• Tree based algorithms are considered to be
one of the best and mostly used supervised
learning methods. Tree based algorithms
empower predictive models with high
accuracy, stability and ease of interpretation.
Unlike linear models, they map non-linear
relationships quite well. They are adaptable at
solving any kind of problem at hand
(classification or regression).
• Methods like decision trees, random forest,
gradient boosting are being popularly used in
all kinds of data science problems.
• Hence, for every analyst (fresher also), it’s
important to learn these algorithms and use
them for modeling.
Decision Tree
• Decision tree is a type of supervised learning
algorithm (having a predefined target variable)
that is mostly used in classification problems.
It works for both categorical and continuous
input and output variables. In this technique,
split the population or sample into two or
more homogeneous sets (or sub-populations)
based on most significant splitter /
differentiator in input variables.
Decision Tree
Example
• Let’s say we have a sample of 30 students with
three variables Gender (Boy/ Girl), Class( IX/ X)
and Height (5 to 6 ft). 15 out of these 30 play
cricket in leisure time. Now, to create a model
to predict who will play cricket during leisure
period? In this problem, we need to segregate
students who play cricket in their leisure time
based on highly significant input variable
among all three.
• This is where decision tree helps, it will
segregate the students based on all values of
three variable and identify the variable, which
creates the best homogeneous sets of
students (which are heterogeneous to each
other). you can see that variable Gender is
able to identify best homogeneous sets
compared to the other two variables.
• As mentioned, decision tree identifies the
most significant variable and it’s value that
gives best homogeneous sets of population.
Now the question which arises is, how does it
identify the variable and the split? To do this,
decision tree uses various algorithms.
Types of Decision Trees
Types of decision tree is based on the type of
target variable we have. It can be of two types:
• Categorical Variable Decision Tree: Decision Tree
which has categorical target variable then it
called as categorical variable decision tree.
Example:- In above scenario of student problem,
where the target variable was “Student will play
cricket or not” i.e. YES or NO.
Types of Decision Trees
• Continuous Variable Decision Tree: Decision Tree has
continuous target variable then it is called as
Continuous Variable Decision Tree.
• Example:- Let’s say we have a problem to predict whether a
customer will pay his renewal premium with an insurance
company (yes/ no). Here we know that income of customer
is a significant variable but insurance company does not
have income details for all customers. Now, as we know
this is an important variable, then we can build a decision
tree to predict customer income based on occupation,
product and various other variables. In this case, we are
predicting values for continuous variable.
Important Terminology related to Tree
based Algorithms
Terminology
• Root Node: It represents entire population or
sample and this further gets divided into two or
more homogeneous sets.
• Splitting: It is a process of dividing a node into
two or more sub-nodes.
• Decision Node: When a sub-node splits into
further sub-nodes, then it is called decision node.
• Leaf/ Terminal Node: Nodes do not split is called
Leaf or Terminal node.
Terminology
• Pruning: When we remove sub-nodes of a decision node, this
process is called pruning. You can say opposite process of
splitting.
• Branch / Sub-Tree: A sub section of entire tree is called branch
or sub-tree.
• Parent and Child Node: A node, which is divided into sub-nodes
is called parent node of sub-nodes where as sub-nodes are the
child of parent node.
These are the terms commonly used for decision trees. As we know
that every algorithm has advantages and disadvantages, below
are the important factors which one should know.
Advantages

• Easy to Understand: Even for people from non-analytical background. It does not
require any statistical knowledge to read and interpret them. Its graphical
representation is very intuitive and users can easily relate their hypothesis.
• Useful in Data exploration: Decision tree is one of the fastest way to identify most
significant variables and relation between two or more variables. With the help of
decision trees, we can create new variables / features that has better power to
predict target variable. For example, we are working on a problem where we have
information available in hundreds of variables, there decision tree will help to
identify most significant variable.
• Less data cleaning required: It requires less data cleaning compared to some other
modeling techniques. It is not influenced by outliers and missing values to a fair
degree.
• Data type is not a constraint: It can handle both numerical and categorical variables.
• Non Parametric Method: Decision tree is considered to be a non-parametric
method. This means that decision trees have no assumptions about the space
distribution and the classifier structure.
Disadvantages

• Over fitting: Over fitting is one of the most practical difficulty for decision tree
models. This problem gets solved by setting constraints on model parameters and
pruning
• Not fit for continuous variables: While working with continuous numerical
variables, decision tree looses information when it categorizes variables in
different categories.
Regression Trees vs Classification Trees
• The terminal nodes (or
leaves) lies at the
bottom of the decision
tree. This means that
decision trees are
typically drawn upside
down such that
leaves are the bottom
& roots are the tops
(shown below).
Regression Trees vs Classification Trees
• Regression trees are used when dependent variable is continuous.
Classification trees are used when dependent variable is categorical.
• In case of regression tree, the value obtained by terminal nodes in the
training data is the mean response of observation falling in that region.
Thus, if an unseen data observation falls in that region, we’ll make its
prediction with mean value.
• In case of classification tree, the value (class) obtained by terminal node
in the training data is the mode of observations falling in that region.
Thus, if an unseen data observation falls in that region, we’ll make its
prediction with mode value.
• Both the trees divide the predictor space (independent variables) into
distinct and non-overlapping regions. For the sake of simplicity, you can
think of these regions as high dimensional boxes or boxes.
.
Regression Trees vs Classification Trees
• Both the trees follow a top-down greedy approach known as recursive binary
splitting. We call it as ‘top-down’ because it begins from the top of tree when
all the observations are available in a single region and successively splits the
predictor space into two new branches down the tree. It is known as ‘greedy’
because, the algorithm cares (looks for best variable available) about only the
current split, and not about future splits which will lead to a better tree.
• This splitting process is continued until a user defined stopping criteria is
reached. For example: we can tell the algorithm to stop once the number of
observations per node becomes less than 50.
• In both the cases, the splitting process results in fully grown trees until the
stopping criteria is reached. But, the fully grown tree is likely to overfit data,
leading to poor accuracy on unseen data. This bring ‘pruning’. Pruning is one
of the technique used tackle overfitting. We’ll learn more about it in
following section
How does a tree based algorithms decide where to split

• The decision of making strategic splits heavily affects a tree’s


accuracy. The decision criteria is different for classification and
regression trees.
• Decision trees use multiple algorithms to decide to split a node in
two or more sub-nodes. The creation of sub-nodes increases the
homogeneity of resultant sub-nodes. In other words, we can say that
purity of the node increases with respect to the target variable.
Decision tree splits the nodes on all available variables and then
selects the split which results in most homogeneous sub-nodes.
• The algorithm selection is also based on type of target variables. Let’s
look at the four most commonly used algorithms in decision tree:
Gini

Gini says, if we select two items from a population


at random then they must be of same class and
probability for this is 1 if population is pure.
• It works with categorical target variable “Success”
or “Failure”.
• It performs only Binary splits
• Higher the value of Gini higher the homogeneity.
• CART (Classification and Regression Tree) uses Gini
method to create binary splits.
Algorithm
• Gini
• Chi Square
• Entropy
• Reduction in Variance
Steps to Calculate Gini for a split
• Calculate Gini for sub-nodes, using formula
sum of square of probability for success and
failure (p^2+q^2).
• Calculate Gini for split using weighted Gini
score of each node of that split
• Referring to example used above, where we
want to segregate the students based on
target variable ( playing cricket or not ). In
the snapshot below, we split the population
using two input variables Gender and Class.
Now, I want to identify which split is producing
more homogeneous sub-nodes using Gini .
• Split on Gender:
• Calculate, Gini for sub-node Female =
(0.2)*(0.2)+(0.8)*(0.8)=0.68
• Gini for sub-node Male =
(0.65)*(0.65)+(0.35)*(0.35)=0.55
• Calculate weighted Gini for Split Gender =
(10/30)*0.68+(20/30)*0.55 = 0.59
• Similar for Split on Class:
• Gini for sub-node Class IX =
(0.43)*(0.43)+(0.57)*(0.57)=0.51
• Gini for sub-node Class X =
(0.56)*(0.56)+(0.44)*(0.44)=0.51
• Calculate weighted Gini for Split Class =
(14/30)*0.51+(16/30)*0.51 = 0.51
• Above, you can see that Gini score for Split on
Gender is higher than Split on Class, hence,
the node split will take place on Gender.
• You might often come across the term ‘Gini
Impurity’ which is determined by subtracting
the gini value from 1. So mathematically we
can say,
• Gini Impurity = 1-Gini
Chi-Square

It is an algorithm to find out the statistical significance between the


differences between sub-nodes and parent node. We measure it
by sum of squares of standardized differences between observed
and expected frequencies of target variable.
• It works with categorical target variable “Success” or “Failure”.
• It can perform two or more splits.
• Higher the value of Chi-Square higher the statistical significance of
differences between sub-node and Parent node.
• Chi-Square of each node is calculated using formula,
• Chi-square = ((Actual – Expected)^2 / Expected)^1/2
• It generates tree called CHAID (Chi-square Automatic Interaction
Detector)
Steps to Calculate Chi-square for a split:

• Calculate Chi-square for individual node by


calculating the deviation for Success and
Failure both
• Calculated Chi-square of Split using Sum of all
Chi-square of success and Failure of each node
of the split
Example: Let’s work with above example that
we have used to calculate Gini.
Split on Gender:

• First we are populating for node Female, Populate the actual value for
“Play Cricket” and “Not Play Cricket”, here these are 2 and 8
respectively.
• Calculate expected value for “Play Cricket” and “Not Play Cricket”,
here it would be 5 for both because parent node has probability of 50%
and we have applied same probability on Female count(10).
• Calculate deviations by using formula, Actual – Expected. It is for “Play
Cricket” (2 – 5 = -3) and for “Not play cricket” ( 8 – 5 = 3).
• Calculate Chi-square of node for “Play Cricket” and “Not Play
Cricket” using formula with formula, = ((Actual – Expected)^2 /
Expected)^1/2. You can refer below table for calculation.
• Follow similar steps for calculating Chi-square value for Male node.
• Now add all Chi-square values to calculate Chi-square for split Gender.
Split on Class:
• Perform similar steps of calculation for split on
Class and you will come up with below table.
Chi-square also identify the Gender split is more significant
compare to Class.
Information Gain
• Look at the image below and think which node can be described easily.
• C because it requires less information as all values are similar. On the
other hand, B requires more information to describe it and A requires
the maximum information. In other words, we can say that C is a Pure
node, B is less Impure and A is more impure.
• Now, we can build a conclusion that less impure
node requires less information to describe it.
And, more impure node requires more
information. Information theory is a measure to
define this degree of disorganization in a
system known as Entropy. If the sample is
completely homogeneous, then the entropy is
zero and if the sample is an equally divided
(50% – 50%), it has entropy of one.
• Entropy can be calculated using formula:-

• Here p and q is probability of success and


failure respectively in that node. Entropy is
also used with categorical target variable. It
chooses the split which has lowest entropy
compared to parent node and other splits.
The lesser the entropy, the better it is.
Steps to calculate entropy for a split:

• Calculate entropy of parent node


• Calculate entropy of each individual node of
split and calculate weighted average of all sub-
nodes available in split.
Entropy
• Entropy for parent node = -(15/30) log2 (15/30) – (15/30)
log2 (15/30) = 1. Here 1 shows that it is a impure node.
• Entropy for Female node = -(2/10) log2 (2/10) – (8/10) log2
(8/10) = 0.72 and for male node, -(13/20) log2 (13/20)
– (7/20) log2 (7/20) = 0.93
• Entropy for split Gender = Weighted entropy of sub-nodes =
(10/30)*0.72 + (20/30)*0.93 = 0.86
• Entropy for Class IX node, -(6/14) log2 (6/14) – (8/14) log2
(8/14) = 0.99 and for Class X node, -(9/16) log2 (9/16)
– (7/16) log2 (7/16) = 0.99.
• Entropy for split Class = (14/30)*0.99 + (16/30)*0.99 = 0.99
Entropy
• Above, you can see that entropy for Split on
Gender is the lowest among all, so the tree
will split on Gender. We can derive
information gain from entropy as 1- Entropy.
Reduction in Variance
• Till now, we have discussed the algorithms for
categorical target variable. Reduction in
variance is an algorithm used
for continuous target variables (regression
problems).The split with lower variance is
selected as the criteria to split the population:
Steps to calculate Variance:
• Calculate variance for each node.
• Calculate variance for each split as weighted
average of each node variance.
• Let’s assign numerical value 1 for play cricket
and 0 for not playing cricket.
• Variance for Root node, here mean value is
(15*1 + 15*0)/30 = 0.5 and we have 15 one
and 15 zero.
• Now variance would be ((1-0.5)^2+(1-0.5)^2+
….15 times+(0-0.5)^2+(0-0.5)^2+…15 times) /
30, this can be written as (15*(1-
0.5)^2+15*(0-0.5)^2) / 30 = 0.25
• Mean of Female node = (2*1+8*0)/10=0.2 and Variance =
(2*(1-0.2)^2+8*(0-0.2)^2) / 10 = 0.16
• Mean of Male Node = (13*1+7*0)/20=0.65 and Variance =
(13*(1-0.65)^2+7*(0-0.65)^2) / 20 = 0.23
• Variance for Split Gender = Weighted Variance of Sub-nodes
= (10/30)*0.16 + (20/30) *0.23 = 0.21
• Mean of Class IX node = (6*1+8*0)/14=0.43 and Variance =
(6*(1-0.43)^2+8*(0-0.43)^2) / 14= 0.24
• Mean of Class X node = (9*1+7*0)/16=0.56 and Variance =
(9*(1-0.56)^2+7*(0-0.56)^2) / 16 = 0.25
• Variance for class Node = (14/30)*0.24 + (16/30) *0.25 = 0.25
Key parameters of tree based algorithms

• Over fitting is one of the key challenges faced while


using tree based algorithms. If there is no limit set
of a decision tree, it will give you 100% accuracy on
training set because in the worse case it will end
up making 1 leaf for each observation. Thus,
preventing over fitting is pivotal while modeling a
decision tree and it can be done in 2 ways:
• Setting constraints on tree size
• Tree pruning
Setting Constraints on tree based
algorithms
Minimum samples for a node split

– Defines the minimum number of samples (or


observations) which are required in a node to be
considered for splitting.
– Used to control over-fitting. Higher values prevent
a model from learning relations which might be
highly specific to the particular sample selected
for a tree.
– Too high values can lead to under-fitting hence, it
should be tuned using CV.
Minimum samples for a terminal node
(leaf)
• Defines the minimum samples (or
observations) required in a terminal node or
leaf.
• Used to control over-fitting similar to
min_samples_split.
• Generally lower values should be chosen for
imbalanced class problems because the
regions in which the minority class will be in
majority will be very small.
Maximum depth of tree (vertical depth)

• The maximum depth of a tree.


• Used to control over-fitting as higher depth
will allow model to learn relations very
specific to a particular sample.
• Should be tuned using CV.
Maximum number of terminal nodes
• The maximum number of terminal nodes or
leaves in a tree.
• Can be defined in place of max_depth. Since
binary trees are created, a depth of ‘n’ would
produce a maximum of 2^n leaves.
Maximum features to consider for split
• The number of features to consider while
searching for a best split. These will be
randomly selected.
• As a thumb-rule, square root of the total
number of features works great but we should
check upto 30-40% of the total number of
features.
• Higher values can lead to over-fitting but
depends on case to case.
Pruning in tree based algorithms
There are 2 lanes:
• A lane with cars moving at 80km/h
• A lane with trucks moving at 30km/h
At this instant, you are the yellow car and you
have 2 choices:
• Take a left and overtake the other 2 cars
quickly
• Keep moving in the present lane
• This is exactly the difference between normal
decision tree & pruning. A decision tree with
constraints won’t see the truck ahead and
adopt a greedy approach by taking a left. On
the other hand if we use pruning, we in effect
look at a few steps ahead and make a choice.
So we know pruning is better. But how to implement it in
decision tree? The idea is simple. We first make the
decision tree to a large depth.
• Then we start at the bottom and start removing leaves
which are giving us negative returns when compared
from the top.
• Suppose a split is giving us a gain of say -10 (loss of 10)
and then the next split on that gives us a gain of 20. A
simple decision tree will stop at step 1 but in pruning,
we will see that the overall gain is +10 and keep both
leaves
Are tree based algorithms better than
linear models
• If the relationship between dependent & independent
variable is well approximated by a linear model, linear
regression will outperform tree based model.
• If there is a high non-linearity & complex relationship
between dependent & independent variables, a tree
model will outperform a classical regression method.
• If you need to build a model which is easy to explain
to people, a decision tree model will always do better
than a linear model. Decision tree models are even
simpler to interpret than linear regression.
Ensemble Method
• The literary meaning of word ‘ensemble’
is group.
• Like every other model, a tree based
algorithm also suffers from the plague of bias
and variance. Bias means, ‘how much on an
average are the predicted values different
from the actual value.’ Variance means, ‘how
different will the predictions of the model be
at the same point if different samples are
taken from the same population’.
• Normally, as you increase the complexity of
your model, you will see a reduction in
prediction error due to lower bias in the
model. As you continue to make your model
more complex, you end up over-fitting your
model and your model will start suffering from
high variance.
A champion model should maintain a balance between these two
types of errors. This is known as the trade-off management of
bias-variance errors. Ensemble learning is one way to execute this
trade off analysis.
Some of the commonly used ensemble
methods include:
• Bagging,
• Boosting and
• Stacking
Bagging
• Bagging is
an ensemble
technique used to
reduce the variance
of our predictions by
combining the result
of multiple classifiers
modeled on different
sub-samples of the
same data set.
The steps followed in bagging are
1. Create Multiple Datasets:
• Sampling is done with replacement on the
original data and new datasets are formed.
• The new data sets can have a fraction of the
columns as well as rows, which are generally
hyper-parameters in a bagging model
• Taking row and column fractions less than 1
helps in making robust models, less prone to
overfitting
2. Build Multiple Classifiers:
• Classifiers are built on each data set.
• Generally the same classifier is modeled on
each data set and predictions are made.
3. Combine Classifiers:
• The predictions of all the classifiers are
combined using a mean, median or mode
value depending on the problem at hand.
• The combined values are generally more
robust than a single model.
There are various implementations of bagging
models. Random forest is one of them.
Random Forest
• Random Forest is considered to be
a panacea of all data science problems. On a
funny note, when you can’t think of any
algorithm (irrespective of situation), use
random forest.
• Random Forest is a versatile machine learning
method capable of performing both
regression and classification tasks. It also
undertakes dimensional reduction methods,
treats missing values, outlier values and other
essential steps of data exploration, and does a
fairly good job. It is a type of ensemble
learning method, where a group of weak
models combine to form a powerful model.
• In Random Forest, we grow multiple trees as
opposed to a single tree in CART model. To
classify a new object based on attributes, each
tree gives a classification and we say the tree
“votes” for that class. The forest chooses the
classification having the most votes (over all
the trees in the forest) and in case of
regression, it takes the average of outputs by
different trees.
It works in the following manner, Each tree is
planted & grown as follows.

• Assume number of cases in the training set is N. Then, sample of


these N cases is taken at random but with replacement. This
sample will be the training set for growing the tree.
• If there are M input variables, a number m<M is specified such
that at each node, m variables are selected at random out of the
M. The best split on these m is used to split the node. The value
of m is held constant while we grow the forest.
• Each tree is grown to the largest extent possible and there is no
pruning.
• Predict new data by aggregating the predictions of the ntree trees
(i.e., majority votes for classification, average for regression).
Boosting
• The term ‘Boosting’ refers to a family of
algorithms which converts weak learner to
strong learners.
How would you classify an email as SPAM or not? Like
everyone else, our initial approach would be to identify
‘spam’ and ‘not spam’ emails using following criteria. If:
• Email has only one image file (promotional image), It’s a
SPAM
• Email has only link(s), It’s a SPAM
• Email body consist of sentence like “You won a prize money
of $ xxxxxx”, It’s a SPAM
• Email from official domain
, Not a SPAM
• Email from known source, Not a SPAM
• Above, we’ve defined multiple rules to
classify an email into ‘spam’ or ‘not
spam’. But, do you think these rules
individually are strong enough to successfully
classify an email? No.
• Individually, these rules are not powerful
enough to classify an email into ‘spam’ or ‘not
spam’. Therefore, these rules are called
as weak learner.
To convert weak learner to strong learner, we’ll
combine the prediction of each weak learner using
methods like:
• Using average/ weighted average
• Considering prediction has higher vote
For example: Above, we have defined 5 weak
learners. Out of these 5, 3 are voted as ‘SPAM’ and
2 are voted as ‘Not a SPAM’. In this case, by default,
we’ll consider an email as SPAM because we have
higher(3) vote for ‘SPAM’.
How boosting identify weak rules?
• To find weak rule, we apply base learning (ML)
algorithms with a different distribution. Each
time base learning algorithm is applied, it
generates a new weak prediction rule. This is
an iterative process. After many iterations, the
boosting algorithm combines these weak rules
into a single strong prediction rule. This is how
the ensemble model is built.
How do we choose different distribution for
each round
• Step 1: The base learner takes all the distributions
and assign equal weight or attention to each
observation.
• Step 2: If there is any prediction error caused by
first base learning algorithm, then we pay higher
attention to observations having prediction error.
Then, we apply the next base learning algorithm.
• Step 3: Iterate Step 2 till the limit of base learning
algorithm is reached or higher accuracy is
achieved.
• Finally, it combines the outputs from weak
learner and creates a strong learner which
eventually improves the prediction power of
the model. Boosting pays higher focus on
examples which are misclassified or have
higher errors by preceding weak rules
Stacking
• Trains multiple models (base
learners) and then trains a meta-
model (also called a second-level
model) to combine their predictions.
• Using different base learners (e.g.,
logistic regression, SVM, neural
network) and a meta-model (e.g.,
linear regression) to combine their
predictions
Thank You

You might also like