Decision Trees
Decision Tree Overview
● Algorithm uses tree structure to model relationships
among the features and the potential outcomes
● It breaks down dataset into smaller subset with
increase in depth of tree
● It’s a flowchart for deciding how to classify a new
observation
Decision Tree Overview
● Decision trees are simple and useful for intepretation
● Has nice graphical representation
● Typically not competitive with best supervised learning approaches
● Random forests and Boosting are some ways to modify the approach
● Sometimes result in dramatic increase in prediction accuracy
● We will focus on basic Classification And Regression Trees (CART)
Terminology in CART
● Root node – first decision node starting from the original data
● Internal nodes – points along the tree where we split the data
● Branch – segment of the tree that connect the nodes
● Leaf/Terminal node – holds final result of splitting the data
Decision Tree Overview
Root Node
Branch
Internal node Internal node
Spli
t
Terminal node Internal node Terminal node Terminal node
Terminal node Terminal node
Classification Vs Regression
● Classification
○ Spam/ Not spam
○ Admit to ICU or not
○ Lend money/ deny
○ Intrusion detection
● Regression
○ Predict stock returns
○ Pricing a house or a car
○ Weather predictions (Temp, Rainfall etc.)
○ Economic growth predictions
Decision Tree Overview
● The decision tree algorithm learns (i.e creates the decision tree from the data set)
through the optimization of an error function.
● CART is one of the special techniques which can be used for solving both regression &
classification problems.
● In this lecture, we will be focusing on how to solve a classification problem. The
approach is similar for solving a regression problem.
Visualizing a Classification tree
● Let’s look at a simple example
● How do we classify the observations into red or blue?
Steps in a decision tree
●
Steps in a decision tree
●
RED CLASS BLUE CLASS
Steps in a decision tree
● Once the final divisions are made, if
a new data point comes in which has
value (0.2,0.3) – the classification BLUE CLASS RED CLASS
tree predicts that the new
observation is Red as it falls in the
region where the majority is Red
RED CLASS BLUE CLASS
Representation as a tree
Red Blue
Blue Red
●
How do we know where to split?
● The below dataset was easy where the pattern was easy to visualize. What about datasets which
are complicated where the patterns are more complex? How does the algorithm decide where to
split?
BLUE CLASS RED CLASS
RED CLASS BLUE CLASS
How do we know where to split?
● We choose the variable and place to split that results in the smallest sum
of the Gini indices of the two new regions
Maximum and minimum of Gini Index (2 classes)
G=1(1-1)+0(1-0)=0
Simple example
How do we know where to split?
G1 G2
G3 G4
Let’s build a small tree for a real world problem
Root Node
MALE FEMALE
Node Node
Group 1
Group 2
Let’s build a small tree for a real world problem
●
Group 1
Group 2
Let’s build a small tree for a real world problem
● A computer would calculate the Gini index for
splitting with different independent variables –
Occupation, Age, Gender etc. and finds the
optimal variable to split the data
Overfitting in Decision trees
● Splitting non-stop eventually
leads to each point being its
own region
● Such a decision tree model
would perform very poorly
on unseen test data.
● Our model is overfitting the
data !
How do we know when to stop?
● One strategy is to stop after the decrease in Gini index due to each split
is smaller than some threshold
● Not good as we may miss some good splits due to stopping early at poor
splits
● Better strategy is to grow a very large tree, then prune it to obtain a
subtree
Pruning
●
Pruning
● Pruning is not available in MS Azure
● How to decide optimal size of a tree?
● Answer – tuning (trial and error) and cross validation
● Build different decision trees with different hyperparameter values. Use
cross validation to observe their performance, and pick the best one
Hyperparameters for Decision Tree
● Maximum Depth- the largest length between the root to leaf – We can define the
depth of the decision tree
● Minimum number of samples per leaf - we can set a minimum for the number of
samples we allow on each leaf.
● Minimum sample split - the minimum number of samples required to split an internal
node
● Maximum features - the number of features that one looks for in each split.
Advantages Vs Disadvantages
Advantages Disadvantages
• Graphical representation of results: • Prediction accuracy is not as good as
very easy to explain to people more complicated approaches
(probably even easier than linear • Computational issue with large
regression) categorical variables
• Intepretation (logic of the model) • Final tree is not very stable (small
• Handle missing data change in data leads to very different
• Handles numeric and categorical tree)
variables
• Captures nonlinear effect