Classification
Decision Trees
EL Moukhtar ZEMMOURI
ENSAM-Meknès
2023-2024
What is a decision tree
• Decision trees are a classification methodology (classifier)
• The classification process is modeled with :
• a set of hierarchical decisions on the feature variables (attributes)
• Arranged in a tree-like structure
E. Zemmouri
2
What is a decision tree
• A decision tree is a tree composed of nodes, branches and leaves.
• Each node contain a test (condition) on one or more attributes.
• è The split criterion
• At each node, the condition is chosen to split training examples into distinct classes
as much as possible
• Each branch of a node represents an outcome of the test
• Example : Color=red, Color=blue, Color=green (3 branches)
• A leaf node represents a class label (class value).
E. Zemmouri
• A new instance is classified by following a matching path to a leaf node.
What is a decision tree
• Example : credit approval
• With attributes : Marital status, gender, age, has children
• New customers :
• (Single, Male, 25, No, ?)
Marital status
• (Married, Female, 35, Yes, ?)
Single Married Divorced
Age Yes Has Children
E. Zemmouri
30 < < 30 yes No
Yes No No Yes
4
Classification
Decision Tree Induction
Building a Decision Tree
• Top-down tree construction : recursive divide-and-conquer
• At start, all training examples are at the root node
• Select one attribute for root node and create corresponding branches
• Split instances into subsets, one subset for each branch
• Repeat recursively for each branch
• Stop the growth of the tree based on a stopping criterion and create a leaf
• A simple stopping criterion : if all instances on a branch have the same class (overfitting !)
• Problem :
E. Zemmouri
• How to choose the best splitting attribute to test in a node?
6
Building a Decision Tree
• Algorithms:
• CART : Breiman et al. 1984
• ID3 : Quinlan 1986
• Support Only nominal attributes
• C4.5 and C5 : Quinlan 1993
• ID3 successor
• Nominal and Numeric attributes supported
E. Zemmouri
7
Split Criteria
• At each node, available attributes are evaluated on the basis of
separating the classes of the training examples.
• A Goodness Function is used for this purpose.
• Split criterion
• Typical goodness functions:
• Information Gain (ID3/C4.5)
• Information Gain Ratio
E. Zemmouri
• Gini Index (CART)
• …
8
Choosing the Splitting Attribute
• Example : weather dataset
• Witch attribute to select first ?
Day Outlook Temp Humidity Windy Play
x1 Sunny Hot High False No
x2 Sunny Hot High True No
x3 Overcast Hot High False Yes
x4 Rainy Mild High False Yes
x5 Rainy Cool Normal False Yes
x6 Rainy Cool Normal True No
x7 Overcast Cool Normal True Yes
x8 Sunny Mild High False No
x9 Sunny Cool Normal False Yes
E. Zemmouri
x10 Rainy Mild Normal False Yes
x11 Sunny Mild Normal True Yes
x12 Overcast Mild High True Yes
x13 Overcast Hot Normal False Yes
9
x14 Rainy Mild High True No
Entropy and Information gain
• Entropy :
• Theory developed by Claude Shannon (1916-2001)
• Information as a quantity measured in bits
• Given a probability distribution, the information required to predict an event
is the distribution’s entropy
• Entropy formula :
• ! = #! , #" , … , ## a discrete random variable
E. Zemmouri
• The entropy & of ! is given by
• & ! = − ∑#$%! ) #$ *+,& ) #$ = − ∑#$%! -$ *+,& -$
10
Entropy and Information gain
• Entropy of a dataset :
• Let !, . be a set of n labeled examples
• Each data point ! ∈ # is labeled by a class y ∈ %
• Where C = {)! , … , )" } is a finite set of . predefined classes
• /! , /# , … , /" the class distribution (proportions of examples of each class )$ )
• The entropy & of ! is given by : & ! = − ∑'$%! -$ *+," -$
• Property :
E. Zemmouri
• 0 ≤ & ! ≤ 1
11
Entropy and Information gain
• Binary classification k = 2
• Two classes (yes/no, 1/0, true/false, +/-, …)
• & ! = −-( *+, -( − -) *+, -)
E. Zemmouri
12
Entropy and Information gain
• Example : Outlook Temp Humidity Windy Play
Sunny Hot High False No
• Entropy of weather dataset Sunny Hot High True No
Overcast Hot High False Yes
- 0
• -*+, = , -#/ = Rainy Mild High False Yes
!. !.
Rainy Cool Normal False Yes
- - 0 0
• & ! =− log − log Rainy Cool Normal True No
!. !. !. !. Overcast Cool Normal True Yes
• = 0.940 Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
E. Zemmouri
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
13
Entropy and Information gain
• Information Gain of an attribute =
• Information before split – Information after split
+$,-(."! #$ )
• !"#$ %, "! = ( % − ∑" ∈ "$%&'(($! ) ×( %$! /"
+$,- (.)
• Notes :
• Information gain increases with the average purity of the subsets that an
attribute produces
• How to chose splitting attribute?
E. Zemmouri
• Choose the one that results in greatest information gain
14
Example : weather data
• 0123 #, 456788. = ?
• : # = 0.940 ?26@
# # / /
• : #%&'())"*+&,,- = − log − log = 0.971 ?26@
. . . .
• : #%&'())"*)01234+' = −1 log 1 − 0 log 0 = 0 ?26@
/ / # #
• : #%&'())"*24$,- = − . log .
− . log .
= 0.971 ?26@
• è
• 0123 #, 456788.
5 4 5
E. Zemmouri
= 0.940 − 0.971× − 0× − 0.971×
14 14 14
= 0.246 ?26@
15
Example : weather data
• & ! = 0.940 89:;
• <=9> !, ?@:*++A = 0.246 89:;
• What that means?
• If we split the training dataset according
to Outlook attribute, we gain 0.246 bits of
information (insights on data) !!
E. Zemmouri
16
Example : Weather data
• Compute Information Gain for :
• Humidity, Windy, Temperature
Attribute Gain
Outlook 0.246
Humidity 0.151
Windy 0.048
Temperature 0.029
E. Zemmouri
• The best attribute is Outlook
• The one to put as root of the decision tree
17
Building the decision Tree
• Construct the decision tree recursively J
• è continuing to split
E. Zemmouri
• !"#$ %&'(&)"%*)& = 0.571 , !"#$ 2#$34 = 0. 02 , !"#$ ℎ*'#3#%4 = 0.971
18
Building the decision Tree
• Stop splitting when data can’t be split any more
• Note: not all leaves need to be pure; sometimes identical instances have
different classes !
• è the final decision tree
E. Zemmouri
19
Interpretation and use
• How to classify new instance ?
• (sunny, cool, normal, true)
• Attribute pertinence :
• Root attribute (outlook) is the most pertinent
• Temperature doesn’t appear in the tree
• If outlook = sunny then humidity is pertinent
• …
E. Zemmouri
20
Interpretation and use
• Rules induction :
• Example : one rule per leaf
• outlook = overcast ⇒ play = yes
• outlook = sunny and humidity = high ⇒ play = no
• …
E. Zemmouri
21
Decision trees
High branching problem
Highly branching attributes
• High branching problem:
• attributes with a large number of values
• extreme case: ID attribute
• Note : Subsets are more likely to be pure if there is a large number of
values
• è Information gain is biased towards choosing attributes with a large
number of values
E. Zemmouri
• è This may result in overfitting
• selection of an attribute that is non-optimal for prediction
23
Highly branching attributes
• Example Day Outlook Temp Humidity Windy Play
1 Sunny Hot High False No
• Weather data with Day attribute 2 Sunny Hot High True No
3 Overcast Hot High False Yes
• <=9> !, D=E = &(!) 4 Rainy Mild High False Yes
• è Information Gain is max for Day 5 Rainy Cool Normal False Yes
attribute 6 Rainy Cool Normal True No
7 Overcast Cool Normal True Yes
8 Sunny Mild High False No
9 Sunny Cool Normal False Yes
Day 10 Rainy Mild Normal False Yes
E. Zemmouri
11 Sunny Mild Normal True Yes
1 2 14
7 13 12 Overcast Mild High True Yes
13 Overcast Hot Normal False Yes
14 Rainy Mild High True No
No No … Yes … Yes No
24
Information Gain Ratio
• A modification of the information gain that reduces its bias on high-
branches attributes
• Gain ratio takes the number and size of branches into account when
choosing an attribute
• It corrects the information gain by taking the intrinsic information of a split
into account
• è how much info do we need to tell which branch an instance belongs to
E. Zemmouri
25
Information Gain Ratio
• Gain ratio (Quinlan 1986) normalizes information gain by :
2345(7, 35 )
• HIJKLIMJN O, I1 =
:;<4=>5?@ (7 , 35 )
• Where :
GDHI J76 89
• <=9> !, =A = & ! − ∑B ∈ BDEF+, D6 ×& !D6 %B
GDHI J
GDHI J76 89 GDHI J76 89
• Q-*9:R>S+ !, =A = − ∑B ∈ BDEF+, D6 ×*+,
GDHI J GDHI J
E. Zemmouri
• è Importance of attribute decreases as split information gets larger
26
Information Gain Ratio
• Example :
• Day attribute :
• 0123 #, L1M = : # = 0.940 ?26@
! !
• N/726O3P8 #, L1M = −14× !:
log !:
= 3.807 ?26@
;.=:;
• 0123S1628 #, L1M = = 0.246
/.>;?
• Outlook attribute :
• 0123 #, 856788. = 0.246 ?26@
. . : : . .
E. Zemmouri
• N/726O3P8 #, 856788. = − !: log !:
− !: log !:
− !: log !:
= 1.577 ?26@
;.#:@
• 0123S1628 #, 85678. = = 0.156
!..??
27
Information Gain Ratio
Attribute Gain Split Info Gain Ratio
Outlook 0.246 1.577 0.156
Humidity 0.151 1 0.151
Windy 0.048 0.985 0.049
Temperature 0.029 1.362 0.021
• Choose attribute that has the best Gain Ratio
• Note that Day ID is manually eliminated !!
E. Zemmouri
28
Information Gain Ratio : some issues !
• Outlook attribute still has good gain ratio
• But day attribute has greater gain ratio
• Standard fix: manually eliminate identifiers to prevent splitting on that type of
attribute
• Problem with gain ratio: it may overcompensate
• May choose an attribute just because its intrinsic information is very low
• Standard fix:
E. Zemmouri
• First, only consider attributes with greater than average information gain
• Then, compare them on gain ratio
29
Decision trees
Gini Index
Gini Index
• Gini index of a dataset %, F :
• Each data point # ∈ ! is labeled by a class E ∈ U
• Where C = {c! , … , cK } is a finite set of A predefined classes
• -$ is the proportion of examples of class [$
• The Gini Index is :
E. Zemmouri
2
GHIH J = K − L M30
0/1
31
Gini Split
• Gini index after splitting based on attribute "! :
• Is the weighted average of the Gini Index values of each subset !D6 %B
GDHI J76 89
• <9>9Q-*9: !, =A = ∑B ∈ BDEF+, D6 ×<9>9 !D6 %B
GDHI J
• The attribute with the smallest GiniSplit is chosen to split the data at
a given node.
E. Zemmouri
• The CART algorithm uses the Gini Index as the split criterion
32
Gini Index vs Entropy
10.2. FEATURE SELECTION FOR CLASSIFICATION
1
• The two metrics achieves a similar goal
0.9
0.8
0.7
• è measure the discriminative power of
CRITERION VALUE
0.6
0.5
a particular feature (attribute) 0.4
0.3
0.2
GINI INDEX
0.1
ENTROPY
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FRACTION OF FIRST CLASS
E. Zemmouri
Figure 10.1: Variation of two feature selection criteria with class distributio
vi belong to the same class, then the Gini index is 0. Therefore, lower values
index imply greater discrimination. An example of the Gini index 33 for a two-cla
for varying values of p1 is illustrated in Fig. 10.1. Note that the index takes on it
value at p1 = 0.5.
The value-specific Gini index is converted into an attributewise Gini index
the number! of data points that take on the value vi for the attribute. Then, fo
r
containing i=1 ni = n data points, the overall Gini index G for the attribute i
the weighted average over the different attribute values as follows:
r
"
G= ni G(vi )/n.
i=1
Lower values of the Gini index imply greater discriminative power. The Gini in
cally defined for a particular feature rather than a subset of features.
10.2.1.2 Entropy
The class-based entropy measure is related to notions of information gain res
fixing a specific attribute value. The entropy measure achieves a similar goal
index at an intuitive level, but it is based on sound information-theoretic pr
before, let pj be the fraction of data points belonging to the class j for attribu
Then, the class-based entropy E(vi ) for the attribute value vi is defined as follo
DecisionE(vtrees
)=−
"
p log (p ). i
k
j 2 j
j=1
C4.5 Algorithm
The class-based entropy value lies in the interval [0, log2 (k)]. Higher values of
imply greater “mixing” of different classes. A value of 0 implies perfect separ
therefore, the largest possible discriminative power. An example of the entropy
class problem with varying values of the probability p1 is illustrated in Fig.
the case of the Gini index, the overall entropy E of an attribute is defined as t
How to deal with numeric attributes?
C4.5
• C4.5 innovations (Quinlan):
• permit numeric attributes
• deal sensibly with missing values
• pruning to deal with noisy data
• C4.5 is one of best-known and most widely-used learning algorithms
• Last research version: C4.8, implemented in Weka as J4.8 (Java)
• Commercial successor: C5.0
E. Zemmouri
36
Numeric attributes
• Simple and standard method : binary splits
• Example : temperature < 30
• But, unlike nominal attributes, a numeric attribute has many possible
splitting points
• Solution is a straightforward extension :
• Evaluate Information Gain (or other measure) for every possible split point of
the attribute
• Choose “best” split point
E. Zemmouri
• Information Gain for best split point is info gain for attribute
• Computationally more demanding
37
Numeric attributes
• Efficient computation of Information Gain :
• Sort instances of the dataset !, . by the values of the numeric attribute
• Evaluate entropy only between points of different classes
• Breakpoints between values of the same class cannot be optimal (Fayyad & Irani, 1992)
• Split points can be placed between values or directly at values
E. Zemmouri
38
Numeric attributes : Gain computation
• Example : Temperature attribute
Temp 30 26 28 21 18 16 15 23 20 24 24 23 27 22
Play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No
• Sort examples : X
Temp 15 16 18 20 21 22 23 23 24 24 26 27 28 30
Play Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
• Evaluate info gain at each breakpoint (example 21.5) : Split Point Gain
• % &, ()*+ ≤ 21.5 =
! !
− " log "
# #
− " log " = 0.722 89:; S = 15.5 0.047
S = 17 0.010
" " ! !
• % &, ()*+ ≥ 21.5 = − $ log $
− $ log $
= 0.991 89:; S = 21.5 0.045
E. Zemmouri
" $ S = 25 0.024
• >?9@ &, ()*+, ; = 21.5 = 0.940 − #! ×0.722 − #! ×0.991 = 0.045 89:;
S = 26.5 0.0002
• Finally choose the breakpoint that gives the best gain.
S = 29 0.113
39
Decision trees
Stopping, Overfitting and Pruning
Stopping Criterion
• When the size of the decision tree increases, il may overfit
• è it may generalize poorly to unseen test instances
• The stopping criterion is generally related to the pruning strategy
E. Zemmouri
41
Pruning
• Goal : prevent overfitting to noise in the data
• Method : remove overgrown subtrees that do not improve the expected accuracy
on new data.
• Example : The contact lenses data
E. Zemmouri
42
Pruning
• Two strategies for pruning the decision tree:
• Postpruning : take a fully-grown decision tree and discard unreliable parts
• Prepruning : stop growing a branch when information becomes unreliable
• Postpruning is preferred in practice
• prepruning can “stop too early”
E. Zemmouri
43
Decision trees
Missing Values
Missing values
• Missing values may be estimated during the data cleaning phase
• Classification can be used for missing values estimation
• Challenge : more than one attribute can have missing values
E. Zemmouri
45
Missing values
• C4.5 handle missing values in the algorithm (denoted “?”)
• Simple idea:
• Treat missing as a separate value of the attribute
• This may not be good if values are missing due to different reasons
• Example : attribute pregnant=missing for a male patient should be treated
differently (no) than for an adult female patient (unknown)
E. Zemmouri
46
Summary
To Sum Up
• Algorithms for top-down induction of decision trees
• ID3, C4.5, CART
• Measures for choosing the Splitting Attribute
• Information Gain : biased towards attributes with a large number of values
• ID3 and C4.5
• Gain Ratio takes number and size of branches into account
• Gini Index
• CART
• There are many other attribute selection criteria, but almost no difference in accuracy of result !
• ID3 vs C4.5
E. Zemmouri
• ID3 process only nominal attributes
• C4.5 process nominal as well as numeric attributes, deal with missing values and noisy data.
48