ChapterThree
Decision Tree
Copyright 2012 Pearson Education, Inc.
Overview
Copyright 2012 Pearson Education, Inc.
0-2
Decision tree induction is a simple but powerful
learning paradigm. In this method a set of
training examples is broken down into smaller
and smaller subsets while at the same time an
associated decision tree get incrementally
developed. At the end of the learning process, a
decision tree covering the training set is
returned.
The decision tree can be thought of as a set
sentences (in Disjunctive Normal Form) written
propositional logic.
Copyright 2012 Pearson Education, Inc.
0-3
At a basic level, machine learning is about
predicting the future based on the past.
For instance, you might wish to predict
how much a user Alice will like a movie
that she hasnt seen, based on her ratings
of movies that she has seen. This means
making informed guesses about some
unobserved property of some object,
based on observed properties of that
object.
Copyright 2012 Pearson Education, Inc.
0-4
Imagine you only ever do four things at the weekend: go
shopping, watch a movie, play tennis or just stay in. What you
do depends on three things: the weather (windy, rainy or
sunny); how much money you have (rich or poor) and whether
your parents are visiting. You say to your yourself: if my
parents are visiting, we'll go to the cinema. If they're not
visiting and it's sunny, then I'll play tennis, but if it's windy, and
I'm rich, then I'll go shopping. If they're not visiting, it's windy
and I'm poor, then I will go to the cinema. If they're not visiting
and it's rainy, then I'll stay in.
To remember all this, you draw a flowchart which will enable
you to read off your decision. We call such diagrams decision
trees. A suitable decision tree for the weekend decision
choices would be as follows:
Copyright 2012 Pearson Education, Inc.
0-5
Copyright 2012 Pearson Education, Inc.
0-6
We can see why such diagrams are called trees, because, while they
are admittedly upside down, they start from a root and have branches
leading to leaves (the tips of the graph at the bottom). Note that the
leaves are always decisions, and a particular decision might be at the
end of multiple branches (for example, we could choose to go to the
cinema for two different reasons).
According to our decision tree diagram, on Saturday morning, when
we wake up, all we need to do is check (a) the weather (b) how much
money we have and (c) whether our parent's car is parked in the
drive. The decision tree will then enable us to make our decision.
Suppose, for example, that the parents haven't turned up and the sun
is shining. Then this path through our decision tree will tell us what to
do:
Copyright 2012 Pearson Education, Inc.
0-7
Copyright 2012 Pearson Education, Inc.
0-8
Hence we run off to play tennis because our
decision tree told us to. Note that the decision tree
covers all eventualities. That is, there are no
values that the weather, the parents turning up or
the money situation could take which aren't
catered for in the decision tree. Note that, in this
lecture, we will be looking at how to automatically
generate decision trees from examples, not at how
to turn thought processes into decision trees.
Copyright 2012 Pearson Education, Inc.
0-9
The basic idea
In the decision tree above, it is significant
that the "parents visiting" node came at the
top of the tree. We don't know exactly the
reason for this, as we didn't see the
example weekends from which the tree
was produced.
Copyright 2012 Pearson Education, Inc.
0-10
However, it is likely that the number of weekends the
parents visited was relatively high, and every
weekend they did visit, there was a trip to the cinema.
Suppose, for example, the parents have visited every
fortnight for a year, and on each occasion the family
visited the cinema. This means that there is no
evidence in favour of doing anything other than
watching a film when the parents visit. Given that we
are learning rules from examples, this means that if
the parents visit, the decision is already made.
Copyright 2012 Pearson Education, Inc.
0-11
Hence we can put this at the top of the
decision tree, and disregard all the
examples where the parents visited when
constructing the rest of the tree. Not
having to worry about a set of examples
will make the construction job easier.
Copyright 2012 Pearson Education, Inc.
0-12
This kind of thinking underlies the ID3
algorithm for learning decisions trees,
which we will describe more formally
below.
Copyright 2012 Pearson Education, Inc.
0-13
The Basic DTL Algorithm
Top-down, greedy search through the space of
possible decision trees (ID3 and C4.5)
Root: best attribute for classification
Which attribute is the best classifier?
answer based on information gain
Copyright 2012 Pearson Education, Inc.
0-14
Entropy
Putting together a decision tree is all a matter of
choosing which attribute to test at each node in
the tree.
We shall define a measure called information
gain which will be used to decide which attribute
to test at each node.
Information gain is itself calculated using a
measure called entropy.
Copyright 2012 Pearson Education, Inc.
0-15
Given a binary categorisation, C, and a set
of examples, S, for which the proportion of
examples categorised as positive by C is
p+ and the proportion of examples
categorised as negative by C is p -, then
the entropy of S is:
Copyright 2012 Pearson Education, Inc.
0-16
Copyright 2012 Pearson Education, Inc.
0-17
Information Gain
We now return to the problem of trying to
determine the best attribute to choose for a
particular node in a tree.
The following measure calculates a
numerical value for a given attribute, A,
with respect to a set of examples, S. Note
that the values of attribute A will range over
a set of possibilities which we call
Values(A),
Copyright 2012 Pearson Education, Inc.
0-18
and that, for a particular value from that
set, v, we write Sv for the set of examples
which have value v for attribute A.
The information gain of attribute A, relative
to a collection of examples, S, is calculated
as:
Copyright 2012 Pearson Education, Inc.
0-19
Decision Tree Learning
Day
Outlook
Temperature
Humidity
Wind
PlayTennis
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
D11
Sunny
Mild
Normal
Strong
Yes
D12
Overcast
Mild
High
Strong
Yes
D13
Overcast
Hot
Normal
Weak
Yes
D14
Rain
Mild
High
Strong
No
[See: Tom M. Mitchell, Machine Learning, McGraw-Hill, 1997]
Copyright 2012 Pearson Education, Inc.
Decision Tree Learning
(Outlook = Sunny Humidity = Normal) (Outlook = Overcast) (Outlook = Rain Wind = Weak)
[See: Tom M. Mitchell, Machine Learning, McGraw-Hill, 1997]
Copyright 2012 Pearson Education, Inc.
Decision Tree Learning
ID3
Building a Decision Tree
1.
2.
3.
4.
First test all attributes and select the on that would function as the best
root;
Break-up the training set into subsets based on the branches of the
root node;
Test the remaining attributes to see which ones fit best underneath the
branches of the root node;
Continue this process for all other branches until
a.
b.
c.
all examples of a subset are of one type
there are no examples left (return majority classification of the parent)
there are no more attributes left (default value should be majority
classification)
Copyright 2012 Pearson Education, Inc.
Decision Tree Learning
Determining which attribute is best (Entropy & Gain)
Entropy (E) is the minimum number of bits needed in order
to classify an arbitrary example as yes or no
E(S) = ci=1 pi log2 pi ,
Where S is a set of training examples,
c is the number of classes, and
pi is the proportion of the training set that is of class i
For our entropy equation 0 log2 0 = 0
The information gain G(S,A) where A is an attribute
G(S,A) E(S) - v in Values(A) (|Sv| / |S|) * E(Sv)
Copyright 2012 Pearson Education, Inc.
Decision Tree Learning
Lets Try an Example!
Play tennis={no, no, yes, yes, yes, no, yes, no, yes, yes,
yes, yes, yes, no}
The target function which named Play tennis contains
two classes:
C1=yes
C2=no
E([C1, C2]) represent that there are C1 positive training elements
and C2 negative elements.
Therefore the Entropy for the training data, E(S), can be
represented as E([9+,5-]) because of the 14 training
examples 9 of them are yes and 5 of them are no.
Copyright 2012 Pearson Education, Inc.
Decision Tree Learning:
A Simple Example
Lets start off by calculating the Entropy of the Training
Set.
9 5
5
9
E ( S ) pi log 2 pi
log 2 log 2 0.940
14 14
14
14
i 1
n
E(S) = E([9+,5-]) = (-9/14 log2 9/14) + (-5/14 log2 5/14)
= 0.94
Gain(S,) = ?
Gain(S,) = ?
Gain(S,) = ?
Gain(S,) = ?
Copyright 2012 Pearson Education, Inc.
Decision Tree Learning:
A Simple Example
Next we will need to calculate the information gain G(S,A)
for each attribute A where A is taken from the set
{Outlook, Temperature, Humidity, Wind}.
Copyright 2012 Pearson Education, Inc.
Decision Tree Learning:
A Simple Example
The information gain for Outlook is:
Outlook :
sunny 3 no overcast 0 no Rain 3no
sunny 2 yes overcast 4 yes Rain 2yes
G(S,Outlook) = E(S) [5/14 * E(Outlook=sunny) + 4/14 *
E(Outlook = overcast) + 5/14 * E(Outlook=rain)]
G(S,Outlook) = E([9+,5-]) [5/14*E(2+,3-) + 4/14*E([4+,0-]) +
5/14*E([3+,2-])]
G(S,Outlook) = 0.94 [5/14*0.971 + 4/14*0.0 + 5/14*0.971]
G(S,Outlook) = 0.246
Copyright 2012 Pearson Education, Inc.
Decision Tree Learning:
A Simple Example
G(S,Temperature) = 0.94 [4/14*E(Temperature=hot) +
6/14*E(Temperature=mild) +
4/14*E(Temperature=cool)]
G(S,Temperature) = 0.94 [4/14*E([2+,2-]) +
6/14*E([4+,2-]) + 4/14*E([3+,1-])]
G(S,Temperature) = 0.94 [4/14 + 6/14*0.918 +
4/14*0.811]
G(S,Temperature) = 0.029
Copyright 2012 Pearson Education, Inc.
Decision Tree Learning:
A Simple Example
G(S,Humidity) = 0.94 [7/14*E(Humidity=high) +
7/14*E(Humidity=normal)]
G(S,Humidity = 0.94 [7/14*E([3+,4-]) + 7/14*E([6+,1-])]
G(S,Humidity = 0.94 [7/14*0.985 + 7/14*0.592]
G(S,Humidity) = 0.1515
Copyright 2012 Pearson Education, Inc.
Decision Tree Learning:
A Simple Example
G(S,Wind) = 0.94 [8/14*0.811 + 6/14*1.00]
G(S,Wind) = 0.048
Copyright 2012 Pearson Education, Inc.
Decision Tree Learning:
A Simple Example
Outlook is our winner!
Copyright 2012 Pearson Education, Inc.
Decision Tree Learning:
A Simple Example
Now that we have discovered the root of our decision tree
we must now recursively find the nodes that should go
below Sunny, Overcast, and Rain.
Copyright 2012 Pearson Education, Inc.
Decision Tree Learning:
A Simple Example
G(Outlook=Rain, Humidity) = 0.971
[2/5*E(Outlook=Rain ^ Humidity=high) +
3/5*E(Outlook=Rain ^Humidity=normal]
G(Outlook=Rain, Humidity) = 0.02
G(Outlook=Rain,Wind) = 0.971- [3/5*0 + 2/5*0]
G(Outlook=Rain,Wind) = 0.971
Copyright 2012 Pearson Education, Inc.
Decision Tree Learning:
A Simple Example
Now our decision tree looks like:
Copyright 2012 Pearson Education, Inc.