Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views58 pages

Lec 14

The document discusses decision trees, a method for classification and regression that mimics human reasoning. It outlines the process of decision-making using a tree structure, explains key terminology, and describes algorithms for selecting the best splits based on statistical measures like entropy and information gain. The content emphasizes the interpretability and effectiveness of decision trees in handling both categorical and quantitative features.

Uploaded by

Muhammad Furrukh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views58 pages

Lec 14

The document discusses decision trees, a method for classification and regression that mimics human reasoning. It outlines the process of decision-making using a tree structure, explains key terminology, and describes algorithms for selecting the best splits based on statistical measures like entropy and information gain. The content emphasizes the interpretability and effectiveness of decision trees in handling both categorical and quantitative features.

Uploaded by

Muhammad Furrukh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

KASHIF JAVED

EED, UET, Lahore

1
Lecture 14
Decision Trees

Readings:
▪ https://people.eecs.berkeley.edu/~jrs/189/ KASHIF JAVED
▪ Chapter 3 of Tom Mitchell, “Machine Learning”,EED,
McGraw Hill, 1997
UET, Lahore
▪ Luis G. Serrano, “Grokking Machine Learning”, 2021
2
Human Reasoning
• Decision tree learning very much resembles human reasoning.

KASHIF JAVED
EED, UET, Lahore

3
Human Reasoning
• Decision tree learning very much resembles human reasoning.
• Consider the scenario: We want to decide whether we should wear a
jacket today.

KASHIF JAVED
EED, UET, Lahore

4
Human Reasoning
• Decision tree learning very much resembles human reasoning.
• Consider the scenario: We want to decide whether we should wear a
jacket today.
• The decision process looks like:
▪ Look outside and check if it’s raining
− If it’s raining
• then wear a jacket
− If it’s not
• then we may check the temperature
• If it is hot, then don’t wear a jacket
KASHIF JAVED
• If it is cold, then wear a jacket
EED, UET, Lahore

5
Human Reasoning
• The decision process can be
represented as a tree

• The decisions are made by


traversing the tree from top to
bottom.

KASHIF JAVED
EED, UET, Lahore

6
DT Terminology

KASHIF JAVED
EED, UET, Lahore

7
Decision Trees
• Nonlinear method for classification and regression.
• Uses tree with 2 node types:
– internal nodes test feature values (usually just one) & branch accordingly
– leaf nodes specify class ℎ(𝑥)

KASHIF JAVED
EED, UET, Lahore

8
Decision Trees

KASHIF JAVED
EED, UET, Lahore

Deciding whether to go out for a picnic 9


Decision Trees

KASHIF JAVED
EED, UET, Lahore

10
Decision Trees

KASHIF JAVED
EED, UET, Lahore

11
Decision Trees
• Cuts x-space into rectangular cells
• Works well with both categorical (e.g., outlook) and quantitative features
(e.g., humidity)
• Interpretable result (inference)
• Decision boundary can be arbitrarily complicated

KASHIF JAVED
EED, UET, Lahore

12
Decision Trees
Linearly separable dataset Non-Linearly separable dataset

KASHIF JAVED
EED, UET, Lahore
Comparison of linear classifiers vs. decision trees on 2 examples. 13
Decision Trees
• Consider classification first.

• Greedy, top-down learning heuristic:

• This algorithm is more or less obvious and has been rediscovered many
times. It’s naturally recursive.

• I’ll show how it works for classification


KASHIF first; later I’ll talk about how it works
JAVED
for regression. EED, UET, Lahore

14
Greedy Algorithms
• At every point, the algorithm makes the best possible available move

• They tend to work well, but no guarantee that making the best possible
move at each timestep gets you to the best overall outcome

• The algorithm never backtracks to reconsider earlier choices

KASHIF JAVED
EED, UET, Lahore

15
Greedy Algorithms

KASHIF JAVED
EED, UET, Lahore

16
The Basic Algorithm
• Evaluate each attribute using a statistical test to determine how well it
alone classifies the training examples

• Select the best attribute and use it as the test at the root node of the tree

• Create a descendant for each possible value of this attribute

• Sort the training examples to the appropriate descendant node

KASHIF
• Repeat the entire process using JAVED
the training examples associated with
each descendant node EED, UET, Lahore

17
Decision Trees
• Let 𝑆 ⊆ {1, 2, . . . , 𝑛} be set of sample point indices

• Top-level call: 𝑆 = {1, 2, . . . , 𝑛}

KASHIF JAVED
EED, UET, Lahore

18
Decision Trees
GrowTree(S )
if (𝑦𝑖 = C for all i ∈ S and some class C) then {
return new leaf(C) [We say the leaves are pure]
} else {
choose best splitting feature j and splitting value 𝛽 (*)
𝑆𝑙 = {𝑖 ∈ 𝑆 ∶ 𝑋𝑖𝑗 < 𝛽} [Or you could use ≤ and >]
𝑆𝑟 = {𝑖 ∈ 𝑆 ∶ 𝑋𝑖𝑗 ≥ 𝛽}
return new node( j, 𝛽, GrowTree(𝑆𝑙 ), GrowTree(𝑆𝑟 ))
}

KASHIF JAVED
EED, UET, Lahore

19
How to Choose Best Split?

Is this a good attribute Which one should we pick?


to split on?
KASHIF JAVED
EED,
Which attribute/split made more UET, Lahore
progress in helping us classify points
correctly?
20
How to Choose Best Split?
• Try all splits.

• For a set 𝑆 , let 𝐽(𝑆) be the cost of 𝑆.

• Choose the split that minimizes 𝐽(𝑆𝑙 ) + 𝐽(𝑆𝑟 ); or the split that minimizes
weighted average
𝑆𝑙 𝐽(𝑆𝑙 ) + 𝑆𝑟 𝐽(𝑆𝑟 )
𝑆𝑙 + 𝑆𝑟
KASHIF JAVED
vertical brackets | · | to denote set cardinality.
EED, UET, Lahore

21
Decision Trees
• How to choose cost 𝐽(𝑆)?

• I’m going to start by suggesting a mediocre cost function, so you can see
why it’s mediocre.

• Idea 1 (bad): Label 𝑆 with the class 𝐶 that labels the most points in 𝑆.
𝐽(𝑆) ← # of points in 𝑆 not in class 𝐶.

KASHIF JAVED
EED, UET, Lahore

22
Decision Trees

KASHIF
Problem: 𝐽(𝑆𝑙 ) + 𝐽(𝑆𝑟 ) = 10 for both JAVED
splits, but left split is much better. Weighted avg
prefers right split! There are manyEED, UET,
different splitsLahore
that all have the same total cost. We
want a cost function that better distinguishes between them.
23
How to Choose Best Split?

KASHIF JAVED
A better split – the oneEED,
that splits
UET, the data into purer subsets
Lahore

24
How to Choose Best Split?

KASHIF JAVED
A perfect attribute would ideally
EED, UET, divide the examples
Lahore
into sub-sets that are all positive or all negative
25
Decision Trees
• Idea 2 (good): Measure the entropy. [An idea from information theory.]
• Let 𝑌 be a random class variable and suppose 𝑃(𝑌 = 𝐶) = 𝑝𝐶
• The surprise of 𝑌 being class 𝐶 is − log 2 𝑝𝐶 . [Always nonnegative.]
– event w/prob. 1 gives us zero surprise
– event w/prob. 0 gives us infinite surprise!

KASHIF JAVED
EED, UET, Lahore

26
Decision Trees
• The entropy of an index set S is the average surprise (Characterizes the
(im)purity of an arbitrary collection of examples)
𝐻 𝑆 = − ෍ 𝑝𝐶 log 2 𝑝𝐶
𝐶

𝑖 ∈ 𝑆: 𝑦𝑖 = 𝐶
𝑝𝐶 =
𝑆
• The proportion of points in 𝑆 that are in class 𝐶
KASHIF JAVED
EED, UET, Lahore

27
Decision Trees
• If all points in 𝑆 belong to same class? 𝐻(𝑆 ) = −1 log 2 1 = 0.

• Half class 𝐶, half class 𝐷? 𝐻 𝑆 = −0.5 log 2 0.5 − 0.5 log 2 0.5 = 1

1
• 𝑛 points, all different classes? 𝐻 𝑆 = − log 2 𝑛 = log 2 𝑛

KASHIF JAVED
EED, UET, Lahore

28
Decision Trees

KASHIF JAVED
Plot of the entropy 𝐻(𝑝𝐶 ) when thereEED, UET,
are only Lahore
two classes. The probability of the second
class is 𝑝𝐷 = 1− 𝑝𝐶 , so we can plot the entropy with just one dependent variable.
29
Decision Trees

KASHIF JAVED
EED, UET, Lahore
If you have > 2 classes, you would need a multidimensional chart to plot the entropy,
but the entropy is still strictly concave. 30
Decision Trees
• Weighted avg entropy after split is

𝑆𝑙 𝐻(𝑆𝑙 ) + 𝑆𝑟 𝐻(𝑆𝑟 )
𝐻𝑎𝑓𝑡𝑒𝑟 =
𝑆𝑙 + 𝑆𝑟

• Gives us the remaining uncertainty after getting info on an attribute

• Choose the attribute/split that KASHIF


minimizes 𝐻𝑎𝑓𝑡𝑒𝑟
JAVED
EED, UET, Lahore

31
Information Gain
• Information gain – expected reduction in entropy caused by partitioning the
examples according to an attribute
• Choose split that maximizes information gain: 𝐻(𝑆) − 𝐻𝑎𝑓𝑡𝑒𝑟
• Same as minimizing 𝐻𝑎𝑓𝑡𝑒𝑟
• Information gain can never be negative
• Information gain – measures how well a given attribute separates the
training examples according to their target classification
KASHIF JAVED
EED, UET, Lahore

32
Example

KASHIF JAVED
EED, UET, Lahore

33
Information Gain
• Info gain always positive except it is zero

▪ when one child is empty or

▪ for all 𝐶, 𝑃 𝑦𝑖 = 𝐶 𝑖 ∈ 𝑆𝑙 = 𝑃 𝑦𝑖 = 𝐶 𝑖 ∈ 𝑆𝑟 .

KASHIF JAVED
EED, UET, Lahore

34
Another Example

KASHIF JAVED
EED, UET, Lahore

35
Calculate Entropy

KASHIF JAVED
EED, UET, Lahore

36
Calculate Info Gain

KASHIF JAVED
EED, UET, Lahore

37
Select the root node

KASHIF JAVED
EED, UET, Lahore

38
Create Branches below the root for each
of its possible values

KASHIF JAVED
EED, UET, Lahore

39
Repeat the process for each nonterminal
descendant node

KASHIF JAVED
EED, UET, Lahore

40
Repeat the process for each nonterminal
descendant node

KASHIF JAVED
EED, UET, Lahore

41
Decision Trees
• Suppose we pick two points on the
entropy curve

• One represents the left child and


the other represents the right child

• The parent also has entropy on the


curve

KASHIF JAVED
EED, UET, Lahore

42
Decision Trees
• If you unite the two sets into one
parent set, the parent set’s value
𝑝𝐶 is the weighted average of the
children’s 𝑝𝐶 ’s.

• Therefore, the point directly above


that point on the curve represents
the parent’s entropy.

KASHIF JAVED
EED, UET, Lahore
𝑝𝐶 𝑖𝑛 𝑆 is the 𝑝𝐶 𝑖𝑛 𝑆𝑟
𝑝𝐶 𝑖𝑛 𝑆𝑙
weighted average 43
Decision Trees
• Now draw a line segment
connecting them.

• Because the entropy curve is


strictly concave, the interior of the
line segment is strictly below the
curve

• Any point on that segment


represents a weighted average of
KASHIF JAVED
the two entropies for suitable EED, UET, Lahore
weights 𝑝𝐶 𝑖𝑛 𝑆𝑟
𝑝𝐶 𝑖𝑛 𝑆𝑙 𝑝𝐶 𝑖𝑛 𝑆 is the
weighted average 44
Decision Trees
• The information gain is the vertical
distance between them.

• So the information gain is positive


unless the two child sets both have
exactly the same 𝑝𝐶 and lie at the
same point on the curve

KASHIF JAVED
EED, UET, Lahore

45
Decision Trees
• Now, contrast entropy curve
against a naïve curve - plot of the
% misclassified

• If we draw a line segment


connecting two points on the
curve, the segment might lie
entirely on the curve.

KASHIF JAVED
EED, UET, Lahore

46
Decision Trees
• The problem is that many different
splits will get the same weighted
average cost

• this test doesn’t distinguish the


quality of different splits well

KASHIF JAVED
EED, UET, Lahore

47
Alternative Measures for Selecting
Attributes
• Natural bias in the information gain measure is that it favors attributes with
many values over those with few values

• Consider Day as an attribute

KASHIF JAVED
EED, UET, Lahore

48
Alternative Measures for Selecting
Attributes

KASHIF JAVED
EED, UET, Lahore

49
Alternative Measures for Selecting
Attributes
• Natural bias in the information gain measure is that it favors attributes with
many values over those with few values

• Consider Day as an attribute

▪ Has a very large number of possible values


▪ Will have the highest IG value as it alone perfectly predicts the target attribute
over the training data
▪ Will be selected for the root node and will lead to a (quite broad) tree of depth
one, which perfectly classifiesKASHIF
the training data
JAVED
EED, UET, Lahore

50
Alternative Measures for Selecting
Attributes
• We need to penalize attributes such as Day.
• Split information is sensitive to how broadly and uniformly the attribute
splits the data:

𝑐 |𝑆𝑖 | |𝑆𝑖 |
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 = − σ𝑖=1 𝑙𝑜𝑔2
|𝑆| |𝑆|

• where S1 through Sc, are the c subsets of examples resulting from


partitioning S by the c-valued attribute
KASHIF A. JAVED
EED, UET, Lahore

51
Gain Ratio
• The Gain Ratio measure is defined in terms of the Gain measure and Split
lnformation, as follows

𝐼𝑛𝑓𝑜 𝐺𝑎𝑖𝑛
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 =
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛

• Split Information term discourages the selection of attributes with many


uniformly distributed values
KASHIF JAVED
EED, UET, Lahore

52
Incorporating Continuous-Valued
Attributes
• More on choosing a split:
– For binary feature 𝑥𝑖 : children are 𝑥𝑖 = 0 & 𝑥𝑖 = 1
– If 𝑥𝑖 has 3+ discrete values: split depends on application
▪ Sometimes it makes sense to use multiway splits; sometimes binary splits.

– If 𝑥𝑖 is quantitative (continuous): sort 𝑥𝑖 values in 𝑆 ; try splitting between


each pair of unequal consecutive values
▪ We can radix sort the points in linear time, and if 𝑛 is huge we should
KASHIF JAVED
EED, UET, Lahore

53
Incorporating Continuous-Valued
Attributes
• Clever bit: As you scan sorted list from left to right, you can update entropy
in 𝑂(1) time per point!

• This is important for obtaining a fast tree-building time.

• Draw a row of 𝐶’s and 𝑋’s; show how we update the # of 𝐶’s and # of 𝑋’s in
each of 𝑆𝑙 and 𝑆𝑟 as we scan from left to right.

KASHIF JAVED
EED, UET, Lahore

54
Incorporating Continuous-Valued
Attributes

KASHIF JAVED
EED, UET, Lahore

55
Incorporating Continuous-Valued
Attributes

KASHIFWe need these 4 numbers to


JAVED
compute entropy
EED, UET, Lahore

56
Time Complexity
• Algs & running times:
• Classify test point: Walk down tree until leaf. Return its label.
Worst-case time is 𝑂(𝑡𝑟𝑒𝑒 𝑑𝑒𝑝𝑡ℎ).

▪ For binary features, that’s ≤ 𝑑. (Quantitative features may go deeper.)


Usually (not always) ≤ 𝑂(log 𝑛).

KASHIF JAVED
EED, UET, Lahore

57
Time Complexity
• Training:
▪ For binary features, try 𝑂(𝑑) splits at each node.
▪ For quantitative features, try 𝑂(𝑛′𝑑) splits; 𝑛′ = points in node
▪ Either way ⇒ 𝑂(𝑛′𝑑) time at this node
▪ Each point participates in 𝑂(𝑑𝑒𝑝𝑡ℎ) nodes, costs O(d) time in each node.
Running time ≤ 𝑂(𝑛𝑑 𝑑𝑒𝑝𝑡ℎ)

KASHIF JAVED
EED, UET, Lahore

58

You might also like