Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views93 pages

Decision Tree#03

Decision tree Lecture #3

Uploaded by

Niha batool
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views93 pages

Decision Tree#03

Decision tree Lecture #3

Uploaded by

Niha batool
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 93

DECISION TREE

• This type of data is easier to classify using logistic


regression.

1
DECISION TREE

•Complex datasets cannot be classified using single decision


boundary.
•We need to split dataset again and again to create multiple
decision boundaries.
•Decision tee helps to build that.

2
DECISION TREE

• An example decision tree.

3
DECISION TREE

• Humans can supply rules to logical reasoning programs

•Another way is to impart the ability of constructing rules to


the machines themselves.

• The machine is given raw data and it is supposed to form


rules (i.e. a model or concept) about the process from which
the data is generated

4
DECISION TREES

Most widely used learning method

It is a method that induces concepts from examples

The learning is supervised: i.e. the classes or categories of the


data instances are known

It represents concepts as decision trees, a representation that


allows us to determine the classification of an object by
testing its values for certain properties

5
DECISION TREES

We may think of each property of an instance as


contributing a certain amount of information to its
classification.

For example, if our goal is to determine the species of an


animal, the discovery that it lays eggs contributes a certain
amount of information to that goal

6
Definition

 Decision tree is a classifier in the form of a tree structure


– Decision node: specifies a test on a single attribute
– Leaf node: indicates the value of the target attribute
– Arc/edge: split of one attribute
– Path: a disjunction of test to make the final decision

 Decision trees classify instances or examples by starting


at the root of the tree and moving through it until a leaf
node.
Why decision tree?

 Decision trees are powerful and popular


tools for classification and prediction.
 Decision trees represent rules, which can
be understood by humans and used in
knowledge system such as database.
key requirements

 Attribute-value description: object or case must


be expressible in terms of a fixed collection of
properties or attributes (e.g., hot, mild, cold).
 Predefined classes (target values): the target
function has discrete output values (bollean or
multiclass)
 Sufficient data: enough training cases should
be provided to learn the model.
Example – Average Disorder
Name Hair Height Weight Lotion Result

Sarah Blonde Average Light No Sunburned

Dana Blonde Tall Average Yes None

Alex Brown Short Average Yes None

Annie Blonde Short Average No Sunburned

Emily Red Average Heavy No Sunburned

Pete Brown Tall Heavy No None

John Brown Average Heavy No None

Katie Blonde Short Light Yes None

10
Average Disorder
 Average Disorder =
∑ Nb / Nt * (∑ - Nbc / Nb log2 Nbc / Nb)
 Where Nb is the number of samples in
branch b,
 Nt is the total number of samples in all
branches,
 Nbc is the total samples in branch b of
class c

11
Example (cont.)

Attribute Name Attribute Values Attribute Occurrences

Hair Blonde 4

Brown 3

Red 1

12
Example (cont.)
 Blonde = 4/8 (-2/4 log2 2/4 -2/4 log2 2/4)
= 4/8 (0.5 + 0.5)
= 0.5
 Brown = 3/8 (-3/3 log2 3/3)

= 3/8 (-1 log2 1)


=0
 Red = 1/8 (-1 log2 1)

=0

13
Example (cont.)
Average Disorder (Hair) = Blonde + Brown
+ Red
= 0.5 + 0 + 0
= 0.5
Average Disorder (Hair) = 0.5

14
Example (cont.)
 Similarly Average Disorder for other
attributes can be calculated; which turns
out to be
 Average Disorder (Hair) = 0.5
 Average Disorder (Height) = 0.6886
 Average Disorder (Weight) = 0.9386
 Average Disorder (Lotion) = 0.6067

15
Example (cont.)
 Most homogeneous attribute is Hair so put
hair as the first test. Tree will be:

Hair

Blonde Brown

Red

16
Example (cont.)
 With red and brown hair color all the training set is
completely classified. So the only problem left is
with blonde hair color.
Attribute Name Attribute Values Attribute Occurrences

Height (with hair = blonde) Tall 1

Average 1

Short 2

17
Example (cont.)
 Tall = 1/4 (-1 log2 1)
=0
 Average = 1/4 (-1 log2 1)

=0
 Short = 2/4 (-1/2 log2 1/2 -1/2 log2 1/2)
= 2/4 (0.5 + 0.5)
= 0.5
 Average Disorder (Height with “hair = blonde”) = 0
+ 0 + 0.5 = 0.5

18
Example (cont.)
 Similarly for other attributes but with hair =
blonde the average disorder is:
 Average Disorder (Height with “hair = blonde”)
= 0.5
 Average Disorder (Weight with “hair =
blonde”) = 1
 Average Disorder (Lotion with “hair = blonde”)
=0

19
Example (cont.)
 Here the lotion is with the minimum
average disorder so it will be the nest test.
Now the tree will become:
Hair

Blonde Brown
Red

Lotion Used

No Yes

20
Example (cont.)
Hair

Blonde Brown

Red

Lotion Used Alex


Emily
Pete
John
Sun burn

No Yes
No Sun burn

Sarah Dana
Annie Katie

Sun burn No Sun burn

21
Entropy
 A measure of homogeneity of the set of examples.

 Given a set S of positive and negative examples of


some target concept (a 2-class problem), the entropy
of set S relative to this binary classification is

E(S) = - p(P)log2 p(P) – p(N)log2 p(N)


Entropy

 Suppose S has 25 examples, 15 positive and 10


negatives [15+, 10-]. Then the entropy of S relative to
this classification is

E(S)=-(15/25) log2(15/25) - (10/25) log2 (10/25)


Some Intuitions
 The entropy is 0 if
the outcome is
``certain’’.
 The entropy is
maximum if we
have no knowledge
of the system (or
any outcome is Entropy of a 2-class problem
with regard to the portion of
equally possible). one of the two groups
Information Gain

 Information gain measures the expected reduction in


entropy, or uncertainty.
Sv
Gain( S , A) Entropy ( S )  
vValues ( A ) S
Entropy ( S v )

 Values(A) is the set of all possible values for attribute A, and


Sv the subset of S for which attribute A has value v Sv = {s in S
| A(s) = v}.
 the first term in the equation for Gain is just the entropy of the
original collection S
 the second term is the expected value of the entropy after S is
partitioned using attribute A
Information Gain

 It is simply the expected reduction in


entropy caused by partitioning the
examples according to this attribute.
 It is the number of bits saved when
encoding the target value of an arbitrary
member of S, by knowing the value of
attribute A.
27
28
29
Example – Entropy/ I.G.

30
Example – Entropy/ I.G.

31
Example – Entropy/ I.G.

32
Example – Entropy/ I.G.

33
Example – Entropy/ I.G.

34
Example – Entropy/ I.G.

35
Example – Entropy/ I.G.

36
Example – Entropy/ I.G.

37
Example – Entropy/ I.G.

38
Example – Entropy/ I.G.

39
Example – Entropy/ I.G.

40
Example – Entropy/ I.G.

41
Example – Entropy/ I.G.

42
Example – Entropy/ I.G.

43
Example – Entropy/ I.G.

44
Example – Entropy/ I.G.

45
Example – Entropy/ I.G.

46
Example – Entropy/ I.G.

47
Example – Entropy/ I.G.

48
Example – Entropy/ I.G.

49
Example – Entropy/ I.G.

50
Example – Entropy/ I.G.

51
Example – Entropy/ I.G.

52
Example – Entropy/ I.G.

53
Example – Entropy/ I.G.

54
Example – Entropy/ I.G.

55
Example – Entropy/ I.G.

56
Example – Entropy/ I.G.

57
Example – Entropy/ I.G.

58
Example – Entropy/ I.G.

59
Example – Entropy/ I.G.

60
Gini Index - Example

61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
From Decision Trees to Rules

84
DECISION TREES

From Decision Trees to Rules

Hair

Blonde Brown

Red

Lotion Used Alex


Emily
Pete
John
Sun burn

No Yes
No Sun burn

Sarah Dana
Annie Katie

Sun burn No Sun burn


85
IDENTIFICATION TREES

From Decision Trees to Rules

Step 3: Make rules from the identification tree

For our example we have:

If the person’s hair is blonde


and the person uses lotion
then the person is not sunburned

If the person’s hair is blonde


and the person uses no lotion
then the person is sunburned

86
IDENTIFICATION TREES

From Decision Trees to Rules

Step 3: Make rules from the identification tree

For our example we have:

If the person’s hair is red


then the person is sunburned

If the person’s hair is brown


then the person is not sunburned

87
IDENTIFICATION TREES

From Decision Trees to Rules

Step 4: Optimize the rules (prune the antecedents)

To simplify a rule, you ask whether any of the antecedents


can be eliminated without changing what the rule does on the
samples

Example:
If hair is blonde and person uses lotion then no sunburn

If we eliminate the 1st antecedent, and check the rule over the
whole database, we find that there are no misclassifications
Hence we can drop this antecedent as unnecessary
88
IDENTIFICATION TREES

From Decision Trees to Rules

Step 4: Optimize the rules (prune the antecedents)

Example:
If hair is blonde and person uses lotion then no sunburn

If we eliminate the 1st antecedent, and check the rule over the
whole database, we find that there are no misclassifications
Hence we can drop this antecedent as unnecessary

89
IDENTIFICATION TREES

From Decision Trees to Rules

Step 4: Optimize the rules (prune the antecedents)

Example:
If hair is blonde and person uses lotion then no sunburn

If we eliminate the 2nd antecedent then the resulting


shortened rule is not consistent with the data, hence it cannot
be eliminated

90
IDENTIFICATION TREES

From Decision Trees to Rules

Step 4: Optimize the rules (eliminate unnecessary rules)

Rules leading to one label can be replaced by a default rule:


If no other rule applies
then label x

However, this means a very big assumption that all of the


uncovered concept space belongs to label x

This may mean a misclassification of an unknown instance

91
Strengths

 can generate understandable rules


 perform classification without much computation
 can handle continuous and categorical variables
 provide a clear indication of which fields are most
important for prediction or classification
Weakness
 Not suitable for prediction of continuous attribute.
 Perform poorly with many class and small data.
 Computationally expensive to train.
 At each node, each candidate splitting field must be sorted
before its best split can be found.
 In some algorithms, combinations of fields are used and a
search must be made for optimal combining weights.
 Pruning algorithms can also be expensive since many
candidate sub-trees must be formed and compared.
 Do not treat well non-rectangular regions.

You might also like