DECISION TREE
• This type of data is easier to classify using logistic
regression.
1
DECISION TREE
•Complex datasets cannot be classified using single decision
boundary.
•We need to split dataset again and again to create multiple
decision boundaries.
•Decision tee helps to build that.
2
DECISION TREE
• An example decision tree.
3
DECISION TREE
• Humans can supply rules to logical reasoning programs
•Another way is to impart the ability of constructing rules to
the machines themselves.
• The machine is given raw data and it is supposed to form
rules (i.e. a model or concept) about the process from which
the data is generated
4
DECISION TREES
Most widely used learning method
It is a method that induces concepts from examples
The learning is supervised: i.e. the classes or categories of the
data instances are known
It represents concepts as decision trees, a representation that
allows us to determine the classification of an object by
testing its values for certain properties
5
DECISION TREES
We may think of each property of an instance as
contributing a certain amount of information to its
classification.
For example, if our goal is to determine the species of an
animal, the discovery that it lays eggs contributes a certain
amount of information to that goal
6
Definition
Decision tree is a classifier in the form of a tree structure
– Decision node: specifies a test on a single attribute
– Leaf node: indicates the value of the target attribute
– Arc/edge: split of one attribute
– Path: a disjunction of test to make the final decision
Decision trees classify instances or examples by starting
at the root of the tree and moving through it until a leaf
node.
Why decision tree?
Decision trees are powerful and popular
tools for classification and prediction.
Decision trees represent rules, which can
be understood by humans and used in
knowledge system such as database.
key requirements
Attribute-value description: object or case must
be expressible in terms of a fixed collection of
properties or attributes (e.g., hot, mild, cold).
Predefined classes (target values): the target
function has discrete output values (bollean or
multiclass)
Sufficient data: enough training cases should
be provided to learn the model.
Example – Average Disorder
Name Hair Height Weight Lotion Result
Sarah Blonde Average Light No Sunburned
Dana Blonde Tall Average Yes None
Alex Brown Short Average Yes None
Annie Blonde Short Average No Sunburned
Emily Red Average Heavy No Sunburned
Pete Brown Tall Heavy No None
John Brown Average Heavy No None
Katie Blonde Short Light Yes None
10
Average Disorder
Average Disorder =
∑ Nb / Nt * (∑ - Nbc / Nb log2 Nbc / Nb)
Where Nb is the number of samples in
branch b,
Nt is the total number of samples in all
branches,
Nbc is the total samples in branch b of
class c
11
Example (cont.)
Attribute Name Attribute Values Attribute Occurrences
Hair Blonde 4
Brown 3
Red 1
12
Example (cont.)
Blonde = 4/8 (-2/4 log2 2/4 -2/4 log2 2/4)
= 4/8 (0.5 + 0.5)
= 0.5
Brown = 3/8 (-3/3 log2 3/3)
= 3/8 (-1 log2 1)
=0
Red = 1/8 (-1 log2 1)
=0
13
Example (cont.)
Average Disorder (Hair) = Blonde + Brown
+ Red
= 0.5 + 0 + 0
= 0.5
Average Disorder (Hair) = 0.5
14
Example (cont.)
Similarly Average Disorder for other
attributes can be calculated; which turns
out to be
Average Disorder (Hair) = 0.5
Average Disorder (Height) = 0.6886
Average Disorder (Weight) = 0.9386
Average Disorder (Lotion) = 0.6067
15
Example (cont.)
Most homogeneous attribute is Hair so put
hair as the first test. Tree will be:
Hair
Blonde Brown
Red
16
Example (cont.)
With red and brown hair color all the training set is
completely classified. So the only problem left is
with blonde hair color.
Attribute Name Attribute Values Attribute Occurrences
Height (with hair = blonde) Tall 1
Average 1
Short 2
17
Example (cont.)
Tall = 1/4 (-1 log2 1)
=0
Average = 1/4 (-1 log2 1)
=0
Short = 2/4 (-1/2 log2 1/2 -1/2 log2 1/2)
= 2/4 (0.5 + 0.5)
= 0.5
Average Disorder (Height with “hair = blonde”) = 0
+ 0 + 0.5 = 0.5
18
Example (cont.)
Similarly for other attributes but with hair =
blonde the average disorder is:
Average Disorder (Height with “hair = blonde”)
= 0.5
Average Disorder (Weight with “hair =
blonde”) = 1
Average Disorder (Lotion with “hair = blonde”)
=0
19
Example (cont.)
Here the lotion is with the minimum
average disorder so it will be the nest test.
Now the tree will become:
Hair
Blonde Brown
Red
Lotion Used
No Yes
20
Example (cont.)
Hair
Blonde Brown
Red
Lotion Used Alex
Emily
Pete
John
Sun burn
No Yes
No Sun burn
Sarah Dana
Annie Katie
Sun burn No Sun burn
21
Entropy
A measure of homogeneity of the set of examples.
Given a set S of positive and negative examples of
some target concept (a 2-class problem), the entropy
of set S relative to this binary classification is
E(S) = - p(P)log2 p(P) – p(N)log2 p(N)
Entropy
Suppose S has 25 examples, 15 positive and 10
negatives [15+, 10-]. Then the entropy of S relative to
this classification is
E(S)=-(15/25) log2(15/25) - (10/25) log2 (10/25)
Some Intuitions
The entropy is 0 if
the outcome is
``certain’’.
The entropy is
maximum if we
have no knowledge
of the system (or
any outcome is Entropy of a 2-class problem
with regard to the portion of
equally possible). one of the two groups
Information Gain
Information gain measures the expected reduction in
entropy, or uncertainty.
Sv
Gain( S , A) Entropy ( S )
vValues ( A ) S
Entropy ( S v )
Values(A) is the set of all possible values for attribute A, and
Sv the subset of S for which attribute A has value v Sv = {s in S
| A(s) = v}.
the first term in the equation for Gain is just the entropy of the
original collection S
the second term is the expected value of the entropy after S is
partitioned using attribute A
Information Gain
It is simply the expected reduction in
entropy caused by partitioning the
examples according to this attribute.
It is the number of bits saved when
encoding the target value of an arbitrary
member of S, by knowing the value of
attribute A.
27
28
29
Example – Entropy/ I.G.
30
Example – Entropy/ I.G.
31
Example – Entropy/ I.G.
32
Example – Entropy/ I.G.
33
Example – Entropy/ I.G.
34
Example – Entropy/ I.G.
35
Example – Entropy/ I.G.
36
Example – Entropy/ I.G.
37
Example – Entropy/ I.G.
38
Example – Entropy/ I.G.
39
Example – Entropy/ I.G.
40
Example – Entropy/ I.G.
41
Example – Entropy/ I.G.
42
Example – Entropy/ I.G.
43
Example – Entropy/ I.G.
44
Example – Entropy/ I.G.
45
Example – Entropy/ I.G.
46
Example – Entropy/ I.G.
47
Example – Entropy/ I.G.
48
Example – Entropy/ I.G.
49
Example – Entropy/ I.G.
50
Example – Entropy/ I.G.
51
Example – Entropy/ I.G.
52
Example – Entropy/ I.G.
53
Example – Entropy/ I.G.
54
Example – Entropy/ I.G.
55
Example – Entropy/ I.G.
56
Example – Entropy/ I.G.
57
Example – Entropy/ I.G.
58
Example – Entropy/ I.G.
59
Example – Entropy/ I.G.
60
Gini Index - Example
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
From Decision Trees to Rules
84
DECISION TREES
From Decision Trees to Rules
Hair
Blonde Brown
Red
Lotion Used Alex
Emily
Pete
John
Sun burn
No Yes
No Sun burn
Sarah Dana
Annie Katie
Sun burn No Sun burn
85
IDENTIFICATION TREES
From Decision Trees to Rules
Step 3: Make rules from the identification tree
For our example we have:
If the person’s hair is blonde
and the person uses lotion
then the person is not sunburned
If the person’s hair is blonde
and the person uses no lotion
then the person is sunburned
86
IDENTIFICATION TREES
From Decision Trees to Rules
Step 3: Make rules from the identification tree
For our example we have:
If the person’s hair is red
then the person is sunburned
If the person’s hair is brown
then the person is not sunburned
87
IDENTIFICATION TREES
From Decision Trees to Rules
Step 4: Optimize the rules (prune the antecedents)
To simplify a rule, you ask whether any of the antecedents
can be eliminated without changing what the rule does on the
samples
Example:
If hair is blonde and person uses lotion then no sunburn
If we eliminate the 1st antecedent, and check the rule over the
whole database, we find that there are no misclassifications
Hence we can drop this antecedent as unnecessary
88
IDENTIFICATION TREES
From Decision Trees to Rules
Step 4: Optimize the rules (prune the antecedents)
Example:
If hair is blonde and person uses lotion then no sunburn
If we eliminate the 1st antecedent, and check the rule over the
whole database, we find that there are no misclassifications
Hence we can drop this antecedent as unnecessary
89
IDENTIFICATION TREES
From Decision Trees to Rules
Step 4: Optimize the rules (prune the antecedents)
Example:
If hair is blonde and person uses lotion then no sunburn
If we eliminate the 2nd antecedent then the resulting
shortened rule is not consistent with the data, hence it cannot
be eliminated
90
IDENTIFICATION TREES
From Decision Trees to Rules
Step 4: Optimize the rules (eliminate unnecessary rules)
Rules leading to one label can be replaced by a default rule:
If no other rule applies
then label x
However, this means a very big assumption that all of the
uncovered concept space belongs to label x
This may mean a misclassification of an unknown instance
91
Strengths
can generate understandable rules
perform classification without much computation
can handle continuous and categorical variables
provide a clear indication of which fields are most
important for prediction or classification
Weakness
Not suitable for prediction of continuous attribute.
Perform poorly with many class and small data.
Computationally expensive to train.
At each node, each candidate splitting field must be sorted
before its best split can be found.
In some algorithms, combinations of fields are used and a
search must be made for optimal combining weights.
Pruning algorithms can also be expensive since many
candidate sub-trees must be formed and compared.
Do not treat well non-rectangular regions.