Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views24 pages

04-Data Maining Classification Decision Trees

Decision trees are a classification methodology that uses a hierarchical tree-like structure to model decisions based on feature variables. The process involves recursively splitting training examples based on attributes to classify new instances, with various algorithms like CART, ID3, and C4.5 employed to determine the best attributes for splitting using criteria such as Information Gain and Gini Index. The document also discusses challenges like overfitting and the high branching problem, along with methods to mitigate these issues.

Uploaded by

Ahmed Ajebli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views24 pages

04-Data Maining Classification Decision Trees

Decision trees are a classification methodology that uses a hierarchical tree-like structure to model decisions based on feature variables. The process involves recursively splitting training examples based on attributes to classify new instances, with various algorithms like CART, ID3, and C4.5 employed to determine the best attributes for splitting using criteria such as Information Gain and Gini Index. The document also discusses challenges like overfitting and the high branching problem, along with methods to mitigate these issues.

Uploaded by

Ahmed Ajebli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Classification

Decision Trees

EL Moukhtar ZEMMOURI
ENSAM-Meknès
2023-2024

What is a decision tree

• Decision trees are a classification methodology (classifier)


• The classification process is modeled with :
• a set of hierarchical decisions on the feature variables (attributes)
• Arranged in a tree-like structure
E. Zemmouri

2
What is a decision tree

• A decision tree is a tree composed of nodes, branches and leaves.


• Each node contain a test (condition) on one or more attributes.
• è The split criterion
• At each node, the condition is chosen to split training examples into distinct classes
as much as possible
• Each branch of a node represents an outcome of the test
• Example : Color=red, Color=blue, Color=green (3 branches)

• A leaf node represents a class label (class value).

E. Zemmouri
• A new instance is classified by following a matching path to a leaf node.

What is a decision tree

• Example : credit approval


• With attributes : Marital status, gender, age, has children
• New customers :
• (Single, Male, 25, No, ?)
Marital status
• (Married, Female, 35, Yes, ?)
Single Married Divorced

Age Yes Has Children


E. Zemmouri

30 < < 30 yes No

Yes No No Yes
4
Classification
Decision Tree Induction

Building a Decision Tree


• Top-down tree construction : recursive divide-and-conquer
• At start, all training examples are at the root node
• Select one attribute for root node and create corresponding branches
• Split instances into subsets, one subset for each branch
• Repeat recursively for each branch
• Stop the growth of the tree based on a stopping criterion and create a leaf
• A simple stopping criterion : if all instances on a branch have the same class (overfitting !)

• Problem :
E. Zemmouri

• How to choose the best splitting attribute to test in a node?

6
Building a Decision Tree

• Algorithms:
• CART : Breiman et al. 1984
• ID3 : Quinlan 1986
• Support Only nominal attributes
• C4.5 and C5 : Quinlan 1993
• ID3 successor
• Nominal and Numeric attributes supported

E. Zemmouri
7

Split Criteria

• At each node, available attributes are evaluated on the basis of


separating the classes of the training examples.
• A Goodness Function is used for this purpose.
• Split criterion
• Typical goodness functions:
• Information Gain (ID3/C4.5)
• Information Gain Ratio
E. Zemmouri

• Gini Index (CART)


• …
8
Choosing the Splitting Attribute
• Example : weather dataset
• Witch attribute to select first ?

Day Outlook Temp Humidity Windy Play

x1 Sunny Hot High False No


x2 Sunny Hot High True No
x3 Overcast Hot High False Yes

x4 Rainy Mild High False Yes

x5 Rainy Cool Normal False Yes


x6 Rainy Cool Normal True No
x7 Overcast Cool Normal True Yes

x8 Sunny Mild High False No

x9 Sunny Cool Normal False Yes

E. Zemmouri
x10 Rainy Mild Normal False Yes

x11 Sunny Mild Normal True Yes

x12 Overcast Mild High True Yes

x13 Overcast Hot Normal False Yes


9
x14 Rainy Mild High True No

Entropy and Information gain

• Entropy :
• Theory developed by Claude Shannon (1916-2001)
• Information as a quantity measured in bits
• Given a probability distribution, the information required to predict an event
is the distribution’s entropy
• Entropy formula :
• ! = #! , #" , … , ## a discrete random variable
E. Zemmouri

• The entropy & of ! is given by


• & ! = − ∑#$%! ) #$ *+,& ) #$ = − ∑#$%! -$ *+,& -$
10
Entropy and Information gain

• Entropy of a dataset :
• Let !, . be a set of n labeled examples
• Each data point ! ∈ # is labeled by a class y ∈ %
• Where C = {)! , … , )" } is a finite set of . predefined classes
• /! , /# , … , /" the class distribution (proportions of examples of each class )$ )

• The entropy & of ! is given by : & ! = − ∑'$%! -$ *+," -$


• Property :

E. Zemmouri
• 0 ≤ & ! ≤ 1

11

Entropy and Information gain

• Binary classification k = 2
• Two classes (yes/no, 1/0, true/false, +/-, …)
• & ! = −-( *+, -( − -) *+, -)
E. Zemmouri

12
Entropy and Information gain

• Example : Outlook Temp Humidity Windy Play

Sunny Hot High False No


• Entropy of weather dataset Sunny Hot High True No
Overcast Hot High False Yes
- 0
• -*+, = , -#/ = Rainy Mild High False Yes
!. !.
Rainy Cool Normal False Yes
- - 0 0
• & ! =− log − log Rainy Cool Normal True No
!. !. !. !. Overcast Cool Normal True Yes

• = 0.940 Sunny Mild High False No

Sunny Cool Normal False Yes

Rainy Mild Normal False Yes

E. Zemmouri
Sunny Mild Normal True Yes

Overcast Mild High True Yes

Overcast Hot Normal False Yes


Rainy Mild High True No

13

Entropy and Information gain

• Information Gain of an attribute =


• Information before split – Information after split
+$,-(."! #$ )
• !"#$ %, "! = ( % − ∑" ∈ "$%&'(($! ) ×( %$! /"
+$,- (.)
• Notes :
• Information gain increases with the average purity of the subsets that an
attribute produces
• How to chose splitting attribute?
E. Zemmouri

• Choose the one that results in greatest information gain

14
Example : weather data

• 0123 #, 456788. = ?
• : # = 0.940 ?26@
# # / /
• : #%&'())"*+&,,- = − log − log = 0.971 ?26@
. . . .

• : #%&'())"*)01234+' = −1 log 1 − 0 log 0 = 0 ?26@


/ / # #
• : #%&'())"*24$,- = − . log .
− . log .
= 0.971 ?26@
• è
• 0123 #, 456788.
5 4 5

E. Zemmouri
= 0.940 − 0.971× − 0× − 0.971×
14 14 14
= 0.246 ?26@

15

Example : weather data

• & ! = 0.940 89:;


• <=9> !, ?@:*++A = 0.246 89:;
• What that means?
• If we split the training dataset according
to Outlook attribute, we gain 0.246 bits of
information (insights on data) !!
E. Zemmouri

16
Example : Weather data

• Compute Information Gain for :


• Humidity, Windy, Temperature

Attribute Gain
Outlook 0.246
Humidity 0.151
Windy 0.048
Temperature 0.029

E. Zemmouri
• The best attribute is Outlook
• The one to put as root of the decision tree
17

Building the decision Tree

• Construct the decision tree recursively J


• è continuing to split
E. Zemmouri

• !"#$ %&'(&)"%*)& = 0.571 , !"#$ 2#$34 = 0. 02 , !"#$ ℎ*'#3#%4 = 0.971


18
Building the decision Tree

• Stop splitting when data can’t be split any more


• Note: not all leaves need to be pure; sometimes identical instances have
different classes !
• è the final decision tree

E. Zemmouri
19

Interpretation and use

• How to classify new instance ?


• (sunny, cool, normal, true)
• Attribute pertinence :
• Root attribute (outlook) is the most pertinent
• Temperature doesn’t appear in the tree
• If outlook = sunny then humidity is pertinent
• …
E. Zemmouri

20
Interpretation and use

• Rules induction :
• Example : one rule per leaf
• outlook = overcast ⇒ play = yes
• outlook = sunny and humidity = high ⇒ play = no
• …

E. Zemmouri
21

Decision trees
High branching problem
Highly branching attributes

• High branching problem:


• attributes with a large number of values
• extreme case: ID attribute
• Note : Subsets are more likely to be pure if there is a large number of
values
• è Information gain is biased towards choosing attributes with a large
number of values

E. Zemmouri
• è This may result in overfitting
• selection of an attribute that is non-optimal for prediction

23

Highly branching attributes

• Example Day Outlook Temp Humidity Windy Play

1 Sunny Hot High False No


• Weather data with Day attribute 2 Sunny Hot High True No
3 Overcast Hot High False Yes
• <=9> !, D=E = &(!) 4 Rainy Mild High False Yes

• è Information Gain is max for Day 5 Rainy Cool Normal False Yes

attribute 6 Rainy Cool Normal True No


7 Overcast Cool Normal True Yes

8 Sunny Mild High False No

9 Sunny Cool Normal False Yes


Day 10 Rainy Mild Normal False Yes
E. Zemmouri

11 Sunny Mild Normal True Yes


1 2 14
7 13 12 Overcast Mild High True Yes

13 Overcast Hot Normal False Yes


14 Rainy Mild High True No
No No … Yes … Yes No
24
Information Gain Ratio

• A modification of the information gain that reduces its bias on high-


branches attributes
• Gain ratio takes the number and size of branches into account when
choosing an attribute
• It corrects the information gain by taking the intrinsic information of a split
into account
• è how much info do we need to tell which branch an instance belongs to

E. Zemmouri
25

Information Gain Ratio

• Gain ratio (Quinlan 1986) normalizes information gain by :


2345(7, 35 )
• HIJKLIMJN O, I1 =
:;<4=>5?@ (7 , 35 )

• Where :
GDHI J76 89
• <=9> !, =A = & ! − ∑B ∈ BDEF+, D6 ×& !D6 %B
GDHI J

GDHI J76 89 GDHI J76 89


• Q-*9:R>S+ !, =A = − ∑B ∈ BDEF+, D6 ×*+,
GDHI J GDHI J
E. Zemmouri

• è Importance of attribute decreases as split information gets larger

26
Information Gain Ratio

• Example :
• Day attribute :
• 0123 #, L1M = : # = 0.940 ?26@
! !
• N/726O3P8 #, L1M = −14× !:
log !:
= 3.807 ?26@
;.=:;
• 0123S1628 #, L1M = = 0.246
/.>;?

• Outlook attribute :
• 0123 #, 856788. = 0.246 ?26@
. . : : . .

E. Zemmouri
• N/726O3P8 #, 856788. = − !: log !:
− !: log !:
− !: log !:
= 1.577 ?26@
;.#:@
• 0123S1628 #, 85678. = = 0.156
!..??

27

Information Gain Ratio

Attribute Gain Split Info Gain Ratio


Outlook 0.246 1.577 0.156
Humidity 0.151 1 0.151
Windy 0.048 0.985 0.049
Temperature 0.029 1.362 0.021

• Choose attribute that has the best Gain Ratio


• Note that Day ID is manually eliminated !!
E. Zemmouri

28
Information Gain Ratio : some issues !

• Outlook attribute still has good gain ratio


• But day attribute has greater gain ratio
• Standard fix: manually eliminate identifiers to prevent splitting on that type of
attribute
• Problem with gain ratio: it may overcompensate
• May choose an attribute just because its intrinsic information is very low
• Standard fix:

E. Zemmouri
• First, only consider attributes with greater than average information gain
• Then, compare them on gain ratio

29

Decision trees
Gini Index
Gini Index

• Gini index of a dataset %, F :


• Each data point # ∈ ! is labeled by a class E ∈ U

• Where C = {c! , … , cK } is a finite set of A predefined classes

• -$ is the proportion of examples of class [$

• The Gini Index is :

E. Zemmouri
2
GHIH J = K − L M30
0/1
31

Gini Split
• Gini index after splitting based on attribute "! :

• Is the weighted average of the Gini Index values of each subset !D6 %B

GDHI J76 89
• <9>9Q-*9: !, =A = ∑B ∈ BDEF+, D6 ×<9>9 !D6 %B
GDHI J

• The attribute with the smallest GiniSplit is chosen to split the data at
a given node.
E. Zemmouri

• The CART algorithm uses the Gini Index as the split criterion
32
Gini Index vs Entropy
10.2. FEATURE SELECTION FOR CLASSIFICATION
1

• The two metrics achieves a similar goal


0.9

0.8

0.7

• è measure the discriminative power of

CRITERION VALUE
0.6

0.5

a particular feature (attribute) 0.4

0.3

0.2

GINI INDEX
0.1
ENTROPY

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FRACTION OF FIRST CLASS

E. Zemmouri
Figure 10.1: Variation of two feature selection criteria with class distributio

vi belong to the same class, then the Gini index is 0. Therefore, lower values
index imply greater discrimination. An example of the Gini index 33 for a two-cla
for varying values of p1 is illustrated in Fig. 10.1. Note that the index takes on it
value at p1 = 0.5.
The value-specific Gini index is converted into an attributewise Gini index
the number! of data points that take on the value vi for the attribute. Then, fo
r
containing i=1 ni = n data points, the overall Gini index G for the attribute i
the weighted average over the different attribute values as follows:
r
"
G= ni G(vi )/n.
i=1

Lower values of the Gini index imply greater discriminative power. The Gini in
cally defined for a particular feature rather than a subset of features.

10.2.1.2 Entropy
The class-based entropy measure is related to notions of information gain res
fixing a specific attribute value. The entropy measure achieves a similar goal
index at an intuitive level, but it is based on sound information-theoretic pr
before, let pj be the fraction of data points belonging to the class j for attribu
Then, the class-based entropy E(vi ) for the attribute value vi is defined as follo

DecisionE(vtrees
)=−
"
p log (p ). i
k

j 2 j
j=1

C4.5 Algorithm
The class-based entropy value lies in the interval [0, log2 (k)]. Higher values of
imply greater “mixing” of different classes. A value of 0 implies perfect separ
therefore, the largest possible discriminative power. An example of the entropy
class problem with varying values of the probability p1 is illustrated in Fig.
the case of the Gini index, the overall entropy E of an attribute is defined as t
How to deal with numeric attributes?

C4.5

• C4.5 innovations (Quinlan):


• permit numeric attributes
• deal sensibly with missing values
• pruning to deal with noisy data
• C4.5 is one of best-known and most widely-used learning algorithms
• Last research version: C4.8, implemented in Weka as J4.8 (Java)
• Commercial successor: C5.0
E. Zemmouri

36
Numeric attributes

• Simple and standard method : binary splits


• Example : temperature < 30
• But, unlike nominal attributes, a numeric attribute has many possible
splitting points
• Solution is a straightforward extension :
• Evaluate Information Gain (or other measure) for every possible split point of
the attribute
• Choose “best” split point

E. Zemmouri
• Information Gain for best split point is info gain for attribute
• Computationally more demanding
37

Numeric attributes

• Efficient computation of Information Gain :


• Sort instances of the dataset !, . by the values of the numeric attribute
• Evaluate entropy only between points of different classes
• Breakpoints between values of the same class cannot be optimal (Fayyad & Irani, 1992)
• Split points can be placed between values or directly at values
E. Zemmouri

38
Numeric attributes : Gain computation
• Example : Temperature attribute
Temp 30 26 28 21 18 16 15 23 20 24 24 23 27 22
Play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No

• Sort examples : X
Temp 15 16 18 20 21 22 23 23 24 24 26 27 28 30
Play Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No

• Evaluate info gain at each breakpoint (example 21.5) : Split Point Gain
• % &, ()*+ ≤ 21.5 =
! !
− " log "
# #
− " log " = 0.722 89:; S = 15.5 0.047
S = 17 0.010
" " ! !
• % &, ()*+ ≥ 21.5 = − $ log $
− $ log $
= 0.991 89:; S = 21.5 0.045

E. Zemmouri
" $ S = 25 0.024
• >?9@ &, ()*+, ; = 21.5 = 0.940 − #! ×0.722 − #! ×0.991 = 0.045 89:;
S = 26.5 0.0002
• Finally choose the breakpoint that gives the best gain.
S = 29 0.113
39

Decision trees
Stopping, Overfitting and Pruning
Stopping Criterion

• When the size of the decision tree increases, il may overfit


• è it may generalize poorly to unseen test instances
• The stopping criterion is generally related to the pruning strategy

E. Zemmouri
41

Pruning
• Goal : prevent overfitting to noise in the data
• Method : remove overgrown subtrees that do not improve the expected accuracy
on new data.
• Example : The contact lenses data
E. Zemmouri

42
Pruning

• Two strategies for pruning the decision tree:


• Postpruning : take a fully-grown decision tree and discard unreliable parts

• Prepruning : stop growing a branch when information becomes unreliable

• Postpruning is preferred in practice


• prepruning can “stop too early”

E. Zemmouri
43

Decision trees
Missing Values
Missing values

• Missing values may be estimated during the data cleaning phase


• Classification can be used for missing values estimation
• Challenge : more than one attribute can have missing values

E. Zemmouri
45

Missing values

• C4.5 handle missing values in the algorithm (denoted “?”)


• Simple idea:
• Treat missing as a separate value of the attribute
• This may not be good if values are missing due to different reasons
• Example : attribute pregnant=missing for a male patient should be treated
differently (no) than for an adult female patient (unknown)
E. Zemmouri

46
Summary

To Sum Up
• Algorithms for top-down induction of decision trees
• ID3, C4.5, CART
• Measures for choosing the Splitting Attribute
• Information Gain : biased towards attributes with a large number of values
• ID3 and C4.5
• Gain Ratio takes number and size of branches into account
• Gini Index
• CART
• There are many other attribute selection criteria, but almost no difference in accuracy of result !
• ID3 vs C4.5
E. Zemmouri

• ID3 process only nominal attributes


• C4.5 process nominal as well as numeric attributes, deal with missing values and noisy data.
48

You might also like