Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
3 views26 pages

CART

Decision trees are a tree-structured algorithm used for modeling relationships in data, suitable for both classification and regression tasks. They are easy to interpret and visualize but may not perform as well as more complex methods like random forests or boosting. Key concepts include root nodes, internal nodes, and leaf nodes, with considerations for overfitting and pruning to optimize model performance.

Uploaded by

jparrojado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views26 pages

CART

Decision trees are a tree-structured algorithm used for modeling relationships in data, suitable for both classification and regression tasks. They are easy to interpret and visualize but may not perform as well as more complex methods like random forests or boosting. Key concepts include root nodes, internal nodes, and leaf nodes, with considerations for overfitting and pruning to optimize model performance.

Uploaded by

jparrojado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Decision Trees

Decision Tree Overview


● Algorithm uses tree structure to model relationships
among the features and the potential outcomes

● It breaks down dataset into smaller subset with


increase in depth of tree

● It’s a flowchart for deciding how to classify a new


observation
Decision Tree Overview
● Decision trees are simple and useful for intepretation
● Has nice graphical representation
● Typically not competitive with best supervised learning approaches
● Random forests and Boosting are some ways to modify the approach
● Sometimes result in dramatic increase in prediction accuracy
● We will focus on basic Classification And Regression Trees (CART)
Terminology in CART
● Root node – first decision node starting from the original data

● Internal nodes – points along the tree where we split the data

● Branch – segment of the tree that connect the nodes

● Leaf/Terminal node – holds final result of splitting the data


Decision Tree Overview
Root Node
Branch

Internal node Internal node


Spli
t

Terminal node Internal node Terminal node Terminal node

Terminal node Terminal node


Classification Vs Regression
● Classification
○ Spam/ Not spam
○ Admit to ICU or not
○ Lend money/ deny
○ Intrusion detection

● Regression
○ Predict stock returns
○ Pricing a house or a car
○ Weather predictions (Temp, Rainfall etc.)
○ Economic growth predictions
Decision Tree Overview
● The decision tree algorithm learns (i.e creates the decision tree from the data set)
through the optimization of an error function.

● CART is one of the special techniques which can be used for solving both regression &
classification problems.

● In this lecture, we will be focusing on how to solve a classification problem. The


approach is similar for solving a regression problem.
Visualizing a Classification tree
● Let’s look at a simple example
● How do we classify the observations into red or blue?
Steps in a decision tree

Steps in a decision tree

RED CLASS BLUE CLASS


Steps in a decision tree
● Once the final divisions are made, if
a new data point comes in which has
value (0.2,0.3) – the classification BLUE CLASS RED CLASS

tree predicts that the new


observation is Red as it falls in the
region where the majority is Red
RED CLASS BLUE CLASS
Representation as a tree

Red Blue
Blue Red

How do we know where to split?
● The below dataset was easy where the pattern was easy to visualize. What about datasets which
are complicated where the patterns are more complex? How does the algorithm decide where to
split?

BLUE CLASS RED CLASS

RED CLASS BLUE CLASS


How do we know where to split?

● We choose the variable and place to split that results in the smallest sum
of the Gini indices of the two new regions
Maximum and minimum of Gini Index (2 classes)

G=1(1-1)+0(1-0)=0
Simple example
How do we know where to split?

G1 G2

G3 G4
Let’s build a small tree for a real world problem

Root Node
MALE FEMALE

Node Node
Group 1

Group 2
Let’s build a small tree for a real world problem

Group 1

Group 2
Let’s build a small tree for a real world problem

● A computer would calculate the Gini index for


splitting with different independent variables –
Occupation, Age, Gender etc. and finds the
optimal variable to split the data
Overfitting in Decision trees

● Splitting non-stop eventually


leads to each point being its
own region
● Such a decision tree model
would perform very poorly
on unseen test data.
● Our model is overfitting the
data !
How do we know when to stop?

● One strategy is to stop after the decrease in Gini index due to each split
is smaller than some threshold
● Not good as we may miss some good splits due to stopping early at poor
splits
● Better strategy is to grow a very large tree, then prune it to obtain a
subtree
Pruning

Pruning

● Pruning is not available in MS Azure


● How to decide optimal size of a tree?

● Answer – tuning (trial and error) and cross validation

● Build different decision trees with different hyperparameter values. Use


cross validation to observe their performance, and pick the best one
Hyperparameters for Decision Tree
● Maximum Depth- the largest length between the root to leaf – We can define the
depth of the decision tree

● Minimum number of samples per leaf - we can set a minimum for the number of
samples we allow on each leaf.

● Minimum sample split - the minimum number of samples required to split an internal
node

● Maximum features - the number of features that one looks for in each split.
Advantages Vs Disadvantages

Advantages Disadvantages

• Graphical representation of results: • Prediction accuracy is not as good as


very easy to explain to people more complicated approaches
(probably even easier than linear • Computational issue with large
regression) categorical variables
• Intepretation (logic of the model) • Final tree is not very stable (small
• Handle missing data change in data leads to very different
• Handles numeric and categorical tree)
variables
• Captures nonlinear effect

You might also like