ISOM3360 Data Mining for Business Analytics
Decision Tree Learning
Instructor: Yi Yang
Department of ISOM
Spring 2023
q Last lecture
q Data preparation
q This Lecture
q Decision tree
2
Data Mining Process
3
Supervised Learning
q Classification is used to predict which class
(discrete value) a data point belongs to
q Fraud detection, customer churn prediction
q Regression is used to predict continuous value.
q Stock price prediction, housing price prediction
4
Training vs. Testing
1. Learn the model on all data, evaluate on parts of data
2. Split data into two parts, learn the model on one part, and
evaluate on the other part.
Recall that ML model is used to make predictions on
unseen data.
5
Training vs. Testing
1. Learn the model on all data, evaluate on parts of data
2. Split data into two parts, learn the model on one part, and
evaluate on the other part.
Supervised learning rule of thumb:
Never ever use testing data to learn your model 6
Supervised Learning data split
q training set—a subset to train a model.
q test set—a subset to test the trained model.
q Large enough to yield statistically meaningful results
q Is representative of the data set as a whole. In other
words, don't pick a test set with different
characteristics than the training set.
7
Decision Tree
q Decision trees are one of the most popular data mining
tools.
q Decision Trees are easy to understand, implement and
use, and computationally cheap.
q “It is probably the machine learning workhorse most
widely used in practice to date.”
q The model comprehensibility is important for
communication to non-DM-savvy stakeholders.
8
Decision Tree
categorical continuous
Employed Balance Age Default
Yes 123,000 50 No
No 51,100 40 Yes
No 68,000 55 No
Yes 34,000 46 Yes
Yes 50,000 44 No
No 100,000 50 Yes
Yes 70,000 35 ?????
Predicting customers who will default on loan payments
9
Decision tree
Tree
v A upside down if-else tree, start Root Node
Employed
with all training data
Yes No
Node Class =
v Each node has an if-else condition Not
Balance Node
about one feature. (Which feature?) Default
v Dataset is split into subset based >=50K <50K
on the condition Class =
v Root node contains all training Not
Default Age
examples
v Leaf node contains a subset of >=45 <45
training examples Class =
Class =
v (Optional) numerical features are Not
Default
discretized Default
Leaf
Node
10
The essence of Decision Tree
q The essence of supervised learning (prediction) is to find
features that are informative and have high
predictability.
q Decision tree methods iteratively select a feature so that,
after splitting, the sub dataset becomes more
pure/homogenous. In other words, the selected feature
has high predictability.
q Information gain is one way to measure informativeness.
11
Entropy
q Entropy H(S) is a measure of the amount of
uncertainty/impurity in the dataset S (i.e. entropy
characterizes the dataset S). It measures chaos.
p(x) is the proportion of class x
in the data S
q Eg. A dataset is composed of 16 cases of class
“Positive” and 14 cases of class “Negative”
Entropy (dataset) =
12
Entropy Exercise
𝑡𝑖𝑝: 0 log ! 0 = 0
13
Let’s play a game. I have someone in my mind,
and your job is to guess this person. You can only
ask yes/no question.
This person is a HK celebrity.
Go!
14
Information Gain
q The information gain is based on the
decrease in entropy after a dataset is split
on a feature.
A: has credit card??
Weighted average of subset entropy
• H(S) – Entropy of set S
• T – The subsets created from splitting set S by feature A
• p(t) – The proportion of subset t to set S
Yes No
• H(t) – Entropy of subset t
15
Information Gain Example
A1: has credit card?? A2: is student??
Entropy=
before
Yes No Yes No after
Entropy=0.837 Entropy = 0.722
IG= 0.997 - 0.615 = 0.382 IG= 0.997 - 0.779 = 0.218 16
Information Gain Exercise
A1 A2
before
Yes No Yes No after
Without calculation, which split (left or right) gives the highest
information gain? Which feature (A1 or A2) do we prefer?
17
Decision Tree: ID3
ID3: Only for classification, only handles
categorical features.
q Step 1: Calculate the information gain of every feature
q Step 2: Split the set S into subsets using the feature for
which the information gain is maximum
q Step 3: Make a decision tree node containing that feature,
divide the dataset by its branches and repeat the same
process on every branch.
q Step 4a: A branch with entropy of 0 is a leaf node.
q Step 4b: A branch with entropy more than 0 needs further
splitting.
18
Outlook: Sunny,
Overcast, Rain
Temp: Hot, Mild,
Cool
Humidity: High,
Normal
Wind: Weak,
Strong
Decision:
Yes (9), No (5)
19
q On board demonstration
20
Recap: The essence of Decision Tree
q The essence of supervised learning (prediction) is to find
features that are informative and have high
predictability.
q Decision tree methods select a feature so that, after
splitting, the sub dataset becomes more homogenous. In
other words, the selected feature has high predictability.
q Information gain is one way to measure informativeness.
21