KASHIF JAVED
EED, UET, Lahore
1
Lecture 14
Decision Trees
Readings:
▪ https://people.eecs.berkeley.edu/~jrs/189/ KASHIF JAVED
▪ Chapter 3 of Tom Mitchell, “Machine Learning”,EED,
McGraw Hill, 1997
UET, Lahore
▪ Luis G. Serrano, “Grokking Machine Learning”, 2021
2
Human Reasoning
• Decision tree learning very much resembles human reasoning.
KASHIF JAVED
EED, UET, Lahore
3
Human Reasoning
• Decision tree learning very much resembles human reasoning.
• Consider the scenario: We want to decide whether we should wear a
jacket today.
KASHIF JAVED
EED, UET, Lahore
4
Human Reasoning
• Decision tree learning very much resembles human reasoning.
• Consider the scenario: We want to decide whether we should wear a
jacket today.
• The decision process looks like:
▪ Look outside and check if it’s raining
− If it’s raining
• then wear a jacket
− If it’s not
• then we may check the temperature
• If it is hot, then don’t wear a jacket
KASHIF JAVED
• If it is cold, then wear a jacket
EED, UET, Lahore
5
Human Reasoning
• The decision process can be
represented as a tree
• The decisions are made by
traversing the tree from top to
bottom.
KASHIF JAVED
EED, UET, Lahore
6
DT Terminology
KASHIF JAVED
EED, UET, Lahore
7
Decision Trees
• Nonlinear method for classification and regression.
• Uses tree with 2 node types:
– internal nodes test feature values (usually just one) & branch accordingly
– leaf nodes specify class ℎ(𝑥)
KASHIF JAVED
EED, UET, Lahore
8
Decision Trees
KASHIF JAVED
EED, UET, Lahore
Deciding whether to go out for a picnic 9
Decision Trees
KASHIF JAVED
EED, UET, Lahore
10
Decision Trees
KASHIF JAVED
EED, UET, Lahore
11
Decision Trees
• Cuts x-space into rectangular cells
• Works well with both categorical (e.g., outlook) and quantitative features
(e.g., humidity)
• Interpretable result (inference)
• Decision boundary can be arbitrarily complicated
KASHIF JAVED
EED, UET, Lahore
12
Decision Trees
Linearly separable dataset Non-Linearly separable dataset
KASHIF JAVED
EED, UET, Lahore
Comparison of linear classifiers vs. decision trees on 2 examples. 13
Decision Trees
• Consider classification first.
• Greedy, top-down learning heuristic:
• This algorithm is more or less obvious and has been rediscovered many
times. It’s naturally recursive.
• I’ll show how it works for classification
KASHIF first; later I’ll talk about how it works
JAVED
for regression. EED, UET, Lahore
14
Greedy Algorithms
• At every point, the algorithm makes the best possible available move
• They tend to work well, but no guarantee that making the best possible
move at each timestep gets you to the best overall outcome
• The algorithm never backtracks to reconsider earlier choices
KASHIF JAVED
EED, UET, Lahore
15
Greedy Algorithms
KASHIF JAVED
EED, UET, Lahore
16
The Basic Algorithm
• Evaluate each attribute using a statistical test to determine how well it
alone classifies the training examples
• Select the best attribute and use it as the test at the root node of the tree
• Create a descendant for each possible value of this attribute
• Sort the training examples to the appropriate descendant node
KASHIF
• Repeat the entire process using JAVED
the training examples associated with
each descendant node EED, UET, Lahore
17
Decision Trees
• Let 𝑆 ⊆ {1, 2, . . . , 𝑛} be set of sample point indices
• Top-level call: 𝑆 = {1, 2, . . . , 𝑛}
KASHIF JAVED
EED, UET, Lahore
18
Decision Trees
GrowTree(S )
if (𝑦𝑖 = C for all i ∈ S and some class C) then {
return new leaf(C) [We say the leaves are pure]
} else {
choose best splitting feature j and splitting value 𝛽 (*)
𝑆𝑙 = {𝑖 ∈ 𝑆 ∶ 𝑋𝑖𝑗 < 𝛽} [Or you could use ≤ and >]
𝑆𝑟 = {𝑖 ∈ 𝑆 ∶ 𝑋𝑖𝑗 ≥ 𝛽}
return new node( j, 𝛽, GrowTree(𝑆𝑙 ), GrowTree(𝑆𝑟 ))
}
KASHIF JAVED
EED, UET, Lahore
19
How to Choose Best Split?
Is this a good attribute Which one should we pick?
to split on?
KASHIF JAVED
EED,
Which attribute/split made more UET, Lahore
progress in helping us classify points
correctly?
20
How to Choose Best Split?
• Try all splits.
• For a set 𝑆 , let 𝐽(𝑆) be the cost of 𝑆.
• Choose the split that minimizes 𝐽(𝑆𝑙 ) + 𝐽(𝑆𝑟 ); or the split that minimizes
weighted average
𝑆𝑙 𝐽(𝑆𝑙 ) + 𝑆𝑟 𝐽(𝑆𝑟 )
𝑆𝑙 + 𝑆𝑟
KASHIF JAVED
vertical brackets | · | to denote set cardinality.
EED, UET, Lahore
21
Decision Trees
• How to choose cost 𝐽(𝑆)?
• I’m going to start by suggesting a mediocre cost function, so you can see
why it’s mediocre.
• Idea 1 (bad): Label 𝑆 with the class 𝐶 that labels the most points in 𝑆.
𝐽(𝑆) ← # of points in 𝑆 not in class 𝐶.
KASHIF JAVED
EED, UET, Lahore
22
Decision Trees
KASHIF
Problem: 𝐽(𝑆𝑙 ) + 𝐽(𝑆𝑟 ) = 10 for both JAVED
splits, but left split is much better. Weighted avg
prefers right split! There are manyEED, UET,
different splitsLahore
that all have the same total cost. We
want a cost function that better distinguishes between them.
23
How to Choose Best Split?
KASHIF JAVED
A better split – the oneEED,
that splits
UET, the data into purer subsets
Lahore
24
How to Choose Best Split?
KASHIF JAVED
A perfect attribute would ideally
EED, UET, divide the examples
Lahore
into sub-sets that are all positive or all negative
25
Decision Trees
• Idea 2 (good): Measure the entropy. [An idea from information theory.]
• Let 𝑌 be a random class variable and suppose 𝑃(𝑌 = 𝐶) = 𝑝𝐶
• The surprise of 𝑌 being class 𝐶 is − log 2 𝑝𝐶 . [Always nonnegative.]
– event w/prob. 1 gives us zero surprise
– event w/prob. 0 gives us infinite surprise!
KASHIF JAVED
EED, UET, Lahore
26
Decision Trees
• The entropy of an index set S is the average surprise (Characterizes the
(im)purity of an arbitrary collection of examples)
𝐻 𝑆 = − 𝑝𝐶 log 2 𝑝𝐶
𝐶
𝑖 ∈ 𝑆: 𝑦𝑖 = 𝐶
𝑝𝐶 =
𝑆
• The proportion of points in 𝑆 that are in class 𝐶
KASHIF JAVED
EED, UET, Lahore
27
Decision Trees
• If all points in 𝑆 belong to same class? 𝐻(𝑆 ) = −1 log 2 1 = 0.
• Half class 𝐶, half class 𝐷? 𝐻 𝑆 = −0.5 log 2 0.5 − 0.5 log 2 0.5 = 1
1
• 𝑛 points, all different classes? 𝐻 𝑆 = − log 2 𝑛 = log 2 𝑛
KASHIF JAVED
EED, UET, Lahore
28
Decision Trees
KASHIF JAVED
Plot of the entropy 𝐻(𝑝𝐶 ) when thereEED, UET,
are only Lahore
two classes. The probability of the second
class is 𝑝𝐷 = 1− 𝑝𝐶 , so we can plot the entropy with just one dependent variable.
29
Decision Trees
KASHIF JAVED
EED, UET, Lahore
If you have > 2 classes, you would need a multidimensional chart to plot the entropy,
but the entropy is still strictly concave. 30
Decision Trees
• Weighted avg entropy after split is
𝑆𝑙 𝐻(𝑆𝑙 ) + 𝑆𝑟 𝐻(𝑆𝑟 )
𝐻𝑎𝑓𝑡𝑒𝑟 =
𝑆𝑙 + 𝑆𝑟
• Gives us the remaining uncertainty after getting info on an attribute
• Choose the attribute/split that KASHIF
minimizes 𝐻𝑎𝑓𝑡𝑒𝑟
JAVED
EED, UET, Lahore
31
Information Gain
• Information gain – expected reduction in entropy caused by partitioning the
examples according to an attribute
• Choose split that maximizes information gain: 𝐻(𝑆) − 𝐻𝑎𝑓𝑡𝑒𝑟
• Same as minimizing 𝐻𝑎𝑓𝑡𝑒𝑟
• Information gain can never be negative
• Information gain – measures how well a given attribute separates the
training examples according to their target classification
KASHIF JAVED
EED, UET, Lahore
32
Example
KASHIF JAVED
EED, UET, Lahore
33
Information Gain
• Info gain always positive except it is zero
▪ when one child is empty or
▪ for all 𝐶, 𝑃 𝑦𝑖 = 𝐶 𝑖 ∈ 𝑆𝑙 = 𝑃 𝑦𝑖 = 𝐶 𝑖 ∈ 𝑆𝑟 .
KASHIF JAVED
EED, UET, Lahore
34
Another Example
KASHIF JAVED
EED, UET, Lahore
35
Calculate Entropy
KASHIF JAVED
EED, UET, Lahore
36
Calculate Info Gain
KASHIF JAVED
EED, UET, Lahore
37
Select the root node
KASHIF JAVED
EED, UET, Lahore
38
Create Branches below the root for each
of its possible values
KASHIF JAVED
EED, UET, Lahore
39
Repeat the process for each nonterminal
descendant node
KASHIF JAVED
EED, UET, Lahore
40
Repeat the process for each nonterminal
descendant node
KASHIF JAVED
EED, UET, Lahore
41
Decision Trees
• Suppose we pick two points on the
entropy curve
• One represents the left child and
the other represents the right child
• The parent also has entropy on the
curve
KASHIF JAVED
EED, UET, Lahore
42
Decision Trees
• If you unite the two sets into one
parent set, the parent set’s value
𝑝𝐶 is the weighted average of the
children’s 𝑝𝐶 ’s.
• Therefore, the point directly above
that point on the curve represents
the parent’s entropy.
KASHIF JAVED
EED, UET, Lahore
𝑝𝐶 𝑖𝑛 𝑆 is the 𝑝𝐶 𝑖𝑛 𝑆𝑟
𝑝𝐶 𝑖𝑛 𝑆𝑙
weighted average 43
Decision Trees
• Now draw a line segment
connecting them.
• Because the entropy curve is
strictly concave, the interior of the
line segment is strictly below the
curve
• Any point on that segment
represents a weighted average of
KASHIF JAVED
the two entropies for suitable EED, UET, Lahore
weights 𝑝𝐶 𝑖𝑛 𝑆𝑟
𝑝𝐶 𝑖𝑛 𝑆𝑙 𝑝𝐶 𝑖𝑛 𝑆 is the
weighted average 44
Decision Trees
• The information gain is the vertical
distance between them.
• So the information gain is positive
unless the two child sets both have
exactly the same 𝑝𝐶 and lie at the
same point on the curve
KASHIF JAVED
EED, UET, Lahore
45
Decision Trees
• Now, contrast entropy curve
against a naïve curve - plot of the
% misclassified
• If we draw a line segment
connecting two points on the
curve, the segment might lie
entirely on the curve.
KASHIF JAVED
EED, UET, Lahore
46
Decision Trees
• The problem is that many different
splits will get the same weighted
average cost
• this test doesn’t distinguish the
quality of different splits well
KASHIF JAVED
EED, UET, Lahore
47
Alternative Measures for Selecting
Attributes
• Natural bias in the information gain measure is that it favors attributes with
many values over those with few values
• Consider Day as an attribute
KASHIF JAVED
EED, UET, Lahore
48
Alternative Measures for Selecting
Attributes
KASHIF JAVED
EED, UET, Lahore
49
Alternative Measures for Selecting
Attributes
• Natural bias in the information gain measure is that it favors attributes with
many values over those with few values
• Consider Day as an attribute
▪ Has a very large number of possible values
▪ Will have the highest IG value as it alone perfectly predicts the target attribute
over the training data
▪ Will be selected for the root node and will lead to a (quite broad) tree of depth
one, which perfectly classifiesKASHIF
the training data
JAVED
EED, UET, Lahore
50
Alternative Measures for Selecting
Attributes
• We need to penalize attributes such as Day.
• Split information is sensitive to how broadly and uniformly the attribute
splits the data:
𝑐 |𝑆𝑖 | |𝑆𝑖 |
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 = − σ𝑖=1 𝑙𝑜𝑔2
|𝑆| |𝑆|
• where S1 through Sc, are the c subsets of examples resulting from
partitioning S by the c-valued attribute
KASHIF A. JAVED
EED, UET, Lahore
51
Gain Ratio
• The Gain Ratio measure is defined in terms of the Gain measure and Split
lnformation, as follows
𝐼𝑛𝑓𝑜 𝐺𝑎𝑖𝑛
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 =
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛
• Split Information term discourages the selection of attributes with many
uniformly distributed values
KASHIF JAVED
EED, UET, Lahore
52
Incorporating Continuous-Valued
Attributes
• More on choosing a split:
– For binary feature 𝑥𝑖 : children are 𝑥𝑖 = 0 & 𝑥𝑖 = 1
– If 𝑥𝑖 has 3+ discrete values: split depends on application
▪ Sometimes it makes sense to use multiway splits; sometimes binary splits.
– If 𝑥𝑖 is quantitative (continuous): sort 𝑥𝑖 values in 𝑆 ; try splitting between
each pair of unequal consecutive values
▪ We can radix sort the points in linear time, and if 𝑛 is huge we should
KASHIF JAVED
EED, UET, Lahore
53
Incorporating Continuous-Valued
Attributes
• Clever bit: As you scan sorted list from left to right, you can update entropy
in 𝑂(1) time per point!
• This is important for obtaining a fast tree-building time.
• Draw a row of 𝐶’s and 𝑋’s; show how we update the # of 𝐶’s and # of 𝑋’s in
each of 𝑆𝑙 and 𝑆𝑟 as we scan from left to right.
KASHIF JAVED
EED, UET, Lahore
54
Incorporating Continuous-Valued
Attributes
KASHIF JAVED
EED, UET, Lahore
55
Incorporating Continuous-Valued
Attributes
KASHIFWe need these 4 numbers to
JAVED
compute entropy
EED, UET, Lahore
56
Time Complexity
• Algs & running times:
• Classify test point: Walk down tree until leaf. Return its label.
Worst-case time is 𝑂(𝑡𝑟𝑒𝑒 𝑑𝑒𝑝𝑡ℎ).
▪ For binary features, that’s ≤ 𝑑. (Quantitative features may go deeper.)
Usually (not always) ≤ 𝑂(log 𝑛).
KASHIF JAVED
EED, UET, Lahore
57
Time Complexity
• Training:
▪ For binary features, try 𝑂(𝑑) splits at each node.
▪ For quantitative features, try 𝑂(𝑛′𝑑) splits; 𝑛′ = points in node
▪ Either way ⇒ 𝑂(𝑛′𝑑) time at this node
▪ Each point participates in 𝑂(𝑑𝑒𝑝𝑡ℎ) nodes, costs O(d) time in each node.
Running time ≤ 𝑂(𝑛𝑑 𝑑𝑒𝑝𝑡ℎ)
KASHIF JAVED
EED, UET, Lahore
58