0% found this document useful (0 votes)

8 views23 pages

Decision Tree

Uploaded by

linghuaxu963

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views23 pages

Decision Tree

Uploaded by

linghuaxu963

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Decision Trees

Input Data Attributes Class

prediction
X1=x1
Classifier Y=y
XM=xM

Training data

1
Decision Tree Example
• Three variables:
– Hair = {blond, dark}
– Height = {tall,short}
– Country = {Gromland, Polvia}
Training data:
(B,T,P) P:2 G:4
(B,T,P) Hair = B? Hair = D?
(B,S,G)
(D,S,G)
(D,T,G) P:2 G:2 P:0 G:2
(B,S,G)
Height = T? Height = S?

P:2 G:0 P:0 G:2

At each level of the tree, After enough splits, only

we split the data one class is represented
according to the value in the node This is a
of on of the attributes P:2terminal
G:4 leaf of the tree
We call that class
Hair = D?the
output class for that
Hair = B? node

P:2 G:2 P:0 G:2

Height = S?
Height = T?
‘G’ is the output
P:2 G:0 P:0 G:2 for this node

2
P:2 G:4

Hair = B? Hair = D?

P:2 G:2 P:0 G:2

Height = T? Height = S?

P:2 G:0 P:0 G:2

The class of a new input can be classified by following the

tree all the way down to a leaf and by reporting the output
of the leaf. For example:
(B,T) is classified as P
(D,S) is classified as G

General Case (Discrete Attributes)

• We have R observations
X1 …….. XM Y
from training data
– Each observation has M Input x’1 …..... x’M ???
attributes X1,..,XM data
– Each Xi can take N distinct
discrete values X1 ……. XM Y
– Each observation has a
class attribute Y with C
distinct (discrete) values Data 1 x1 …..... xM y
– Problem: Construct a Data 2
sequence of tests on the
attributes such that, given a ………
new input (x1,..,xM), the class Data R
attribute y is correctly
predicted Training Data
X = attributes of training data (RxM) Y = Class of training data (R)

3
General Decision Tree (Discrete
Attributes)
X1 = X1 =
first possible value for X1? nth possible value for X1 ?

Output class Y = y1
Xj =
ith possible value for Xj ?

Output class Y = yc

X2=0.5
Decision Tree Example

:7 :4

X1 < 0.5 ??

:3 :4 :4 :0
X1=0.5
X2 < 0.5??

:3 :0 :0 :4

4
:7 :4

X1 < 0.5 ??

:3 :4 :4 :0

X2 < 0.5??

:3 :0 :0 :4

The class of a new input can be classified by following the

tree all the way down to a leaf and by reporting the output
of the leaf. For example:
(0.2,0.8) is classified as
(0.8,0.2) is classified as

General Case (Continuous Attributes)

• We have R observations
from training data X1 …….. XM Y
– Each observation has M Input x’1 …..... x’M ???
attributes X1,..,XM
– Each Xi can take N distinct
data
discrete values
– Each observation has a X1 ……. XM Y
class attribute Y with C
distinct (discrete) values
– Problem: Construct a Data 1 x1 …..... xM y
sequence of tests of the
form Xi < ti ? on the Data 2
attributes such that, given ………
a new input (x1,..,xM), the
class attribute y is correctly Data R
predicted
Training Data
X = attributes of training data (RxM) Y = Class of training data (R)

5
General Decision Tree (Continuous
Attributes)

X 1 < t 1?

Output class Y = y1

X j < tj ?
Output class Y = yc

Basic Questions
• How to choose the attribute/value to split
on at each level of the tree?
• When to stop splitting? When should a
node be declared a leaf?
• If a leaf node is impure, how should the
class label be assigned?
• If the tree is too large, how can it be
pruned?

6
How to choose the attribute/value to split on
at each level of the tree?

• Two classes (red circles/green crosses)

• Two attributes: X1 and X2
• 11 points in training data
• Idea Construct a decision tree such that the
leaf nodes predict correctly the class for all the
training examples

How to choose the attribute/value to split on

at each level of the tree?

Good Bad

7
This node is
This node is
“pure” becauseGood
almost “pure”
Bad
These nodes contain a
there is only
one class left Little mixture of classes
No ambiguity in ambiguity in the Do not disambiguate
the class label class label between the classes

We want to find the most compact, smallest size

tree (Occam’s razor), that classifies the training
data correctly We want to find the split choices
that will get us the fastest to pure nodes

This node is
“pure” becauseGoodThis node is
almost “pure”
Bad
These nodes contain a
there is only
one class left Little mixture of classes
No ambiguity in ambiguity in the Do not disambiguate
the class label class label between the classes

8
Digression: Information Content
• Suppose that we are dealing with data which can come from
four possible values (A, B, C, D)
• Each class may appear with some probability
• Suppose P(A) = P(B) = P(C) = P(D) = 1/4
• What is the average number of bits necessary to encode each
class?
• In this case: average = 2 = 2xP(A)+2xP(B)+2xP(C)+2xP(D)
– A 00 B 01 C 10 D 11
Frequency of
occurrence

The distribution is not very

informative impure

A B CD Class Number

Information Content
• Suppose now P(A) = 1/2 P(B) = 1/4 P(C) = 1/8 P(D) = 1/8
• What is the average number of bits necessary to encode each
class?
• In this case, the classes can be encoded by using 1.75 bits on
average
• A 0 B 10 C 110 D 111
• Average
= 1xP(A)+2xP(B)+3xP(C)+3xP(D) = 1.75
Frequency of
occurrence

The distribution is more

informative higher purity

A B CD Class Number

9
Entropy
• In general, the average number of bits
necessary to encode n values is the
entropy:
n
H = −∑ − Pi log 2 Pi
i =1

• Pi = probability of occurrence of value i

– High entropy All the classes are (nearly)
equally likely
– Low entropy A few classes are likely; most
of the classes are rarely observed

Entropy
High
Frequency of
occurrence

Entropy The entropy

captures the
degree of “purity”
of the distribution
Class Number
Frequency of

Low
occurrence

Entropy

Class Number

10
Example Entropy Calculation
1 2

NA = 1 NA = 3
NB = 6 NB = 2
pA = NA/(NA+NB) = 1/7 pA = NA/(NA+NB) = 3/5
pB = NB/(NA+NB) = 6/7 pB = NB/(NA+NB) = 2/5

H1 = -pAlog2 pA – pBlog2 pB H2 = -pAlog2 pA – pBlog2 pB

= 0.59 = 0.97
H1 < H2 => (2) less pure than (1)

Example Entropy Calculation

1 Frequency of occurrence
2
of class A in node (1)

NA = 1 NA =of3 occurrence
Frequency
NB = 6 of classNBB in
= 2node (1)
pA = NA/(NA+NB) = 1/7 pA = NA/(NA+NB) = 3/5
pB = NB/(NA+NB) = 6/7 pB = NEntropy of node (1)
B/(NA+NB) = 2/5

H1 = -pAlog2 pA – pBlog2 pB H2 = -pAlog2 pA – pBlog2 pB

= 0.59 = 0.97
H1 < H2 => (2) less pure than (1)

11
Conditional Entropy
Entropy before splitting: H

After splitting, a fraction After splitting, a fraction

PL of the data goes to PR of the data goes to the
the left node, which has left node, which has
entropy HL entropy HR

The average entropy after splitting is:

HLx PL+ HR x PR

Conditional Entropy
Entropy before splitting: H

After splitting, a fraction After splitting, a fraction

Probability that a random input
PL of the data goes to PR of the data goes to the
is directed to the left node
Entropy
the left node,ofwhich
left has left node, which has
entropy H node
L entropy HR

The average entropy after splitting is:

HLx PL+ HR x PR
“Conditional Entropy”

12
Information Gain
H PR
PL

HL HR

We want nodes as pure as possible

We want to reduce the entropy as much as possible
We want to maximize the difference between the
entropy of the parent node and the expected entropy of
the children
Maximize:

IG = H – (HLx PL+ HR x PR)

Notations
• Entropy: H(Y) = Entropy of the distribution
of classes at a node
• Conditional Entropy:
– Discrete: H(Y|Xj) = Entropy after splitting with
respect to variable j
– Continuous: H(Y|Xj,t) = Entropy after splitting
with respect to variable j with threshold t
• Information gain:
– Discrete: IG(Y|Xj) = H(Y) - H(Y|Xj) = Entropy
after splitting with respect to variable j
– Continuous: IG(Y|Xj,t) = H(Y) - H(Y|Xj,t) =
Entropy after splitting with respect to variable j
with threshold t

13
Information Gain
H PR
PL

HL HR

We want nodes as pure as possible

We want to reduce the entropy as much as possible
We want to maximize the difference between the
entropy of the parent node and the expected entropy of
the children Information Gain (IG) = Amount by
Maximize: which the ambiguity is decreased
by splitting the node

IG = H – (HLx PL+ HR x PR)

H = 0.99 H = 0.99

IG = IG =
H – (HL * 4/11 + HR * 7/11) H – (HL * 5/11 + HR * 6/11)

HL = 0 HR = 0.58 HL = 0.97 HR = 0.92

14
H = 0.99 H = 0.99

IG = 0.62 IG = 0.052

HL = 0 HR = 0.58 HL = 0.97 HR = 0.92

H = 0.99 H = 0.99
Choose this split because the
information gain is greater
than with the other split

IG = 0.62 IG = 0.052

HL = 0 HR = 0.58 HL = 0.97 HR = 0.92

15
More Complete Example

= 20 training examples from class A

= 20 training examples from class B
Attributes = X1 and X2 coordinates

X1 Split value
Best split value (max Information Gain) for X1
attribute: 0.24 with IG = 0.138

16
IG

X2 Split value
Best split value (max Information Gain) for X2
attribute: 0.234 with IG = 0.202

Best X1 split: 0.24, IG = 0.138

Best X2 split: 0.234, IG = 0.202

Split on X2 with 0.234

17
Best X split: 0.24, IG = 0.138
There
Best is no point
Y split: in splitting
0.234, IG = 0.202
this node further since it
contains only data from a
single class return it as a
leaf Split
node on
withYoutput ‘A’
with 0.234

This node is not pure so we

need to split further
X2

X1 Split value
Best split value (max Information Gain) for X1
attribute: 0.22 with IG ~ 0.182

18
IG

X2 Split value
Best split value (max Information Gain) for X2
attribute: 0.75 with IG ~ 0.353

Best X1 split: 0.22, IG = 0.182

Best X2 split: 0.75, IG = 0.353

Split on X2 with 0.75

19
Best X split: 0.22, IG = 0.182
There is no point in splitting
Best
thisYnode
split: 0.75,since
further IG =it 0.353
contains only data from a
single class return it as a
leaf node with output ‘A’
Split on X with 0.5

A
Final decision tree
A B A

A
X2

Each of the leaf

nodes is pure X1
contains data from
only one class

20
A
Final decision tree
Given an input (X,Y)
A B A Follow the tree down to a
leaf.
Return corresponding
output class for this leaf
A
X2

Example (X,Y) = (0.5,0.5)

21
Pure and Impure Leaves and When
to Stop Splitting
All the data in the node comes from a
single class We declare the node to be
a leaf and stop splitting. This leaf will
output the class of the data it contains

Several data points have exactly the same

attributes even though they are from the
same class We cannot split any further
We still declare the node to be a leaf, but it
will output the class that is the majority of
the classes in the node (in this example, ‘B’)

Decision Tree Algorithm (Continuous Attributes)

• LearnTree(X,Y)
– Input:
• Set X of R training vectors, each containing the values (x1,..,xM) of
M attributes (X1,..,XM)
• A vector Y of R elements, where yj = class of the jth datapoint
– If all the datapoints in X have the same class value y
• Return a leaf node that predicts y as output
– If all the datapoints in X have the same attribute value (x1,..,xM)
• Return a leaf node that predicts the majority of the class values in Y
as output
– Try all the possible attributes Xj and threshold t and choose the
one, j*, for which IG(Y|Xj,t) is maximum
– XL, YL= set of datapoints for which xj* < t and corresponding
classes
– XH, YH = set of datapoints for which xj* >= t and corresponding
classes
– Left Child  LearnTree(XL,YL)
– Right Child  LearnTree(XH,YH)

22
Decision Tree Algorithm (Discrete Attributes)
• LearnTree(X,Y)
– Input:
• Set X of R training vectors, each containing the values
(x1,..,xM) of M attributes (X1,..,XM)
• A vector Y of R elements, where yj = class of the jth datapoint
– If all the datapoints in X have the same class value y
• Return a leaf node that predicts y as output
– If all the datapoints in X have the same attribute value
(x1,..,xM)
• Return a leaf node that predicts the majority of the class
values in Y as output
– Try all the possible attributes Xj and choose the one,
j*, for which IG(Y|Xj) is maximum
– For every possible value v of Xj*:
• Xv, Yv= set of datapoints for which xj* = v and corresponding
classes
• Childv  LearnTree(Xv,Yv)

Decision Trees So Far

• Given R observations from training data, each
with M attributes X and a class attribute Y,
construct a sequence of tests (decision tree) to
predict the class attribute Y from the attributes X
• Basic strategy for defining the tests (“when to
split”) maximize the information gain on the
training data set at each node of the tree
• Problems (next):
– Computational issues How expensive is it to
compute the IG
– The tree will end up being much too big pruning
– Evaluating the tree on training data is dangerous
overfitting

Module3 Final
No ratings yet
Module3 Final
66 pages
Cse 445 Lecture 8 Mma
No ratings yet
Cse 445 Lecture 8 Mma
107 pages
CENG313 Introduction To Data Science: Lecture 12: Classification Decision Trees
No ratings yet
CENG313 Introduction To Data Science: Lecture 12: Classification Decision Trees
61 pages
Decision Tree
No ratings yet
Decision Tree
58 pages
4.decision Tree
No ratings yet
4.decision Tree
39 pages
Unit6 - 2 Classification-Decision-Trees
No ratings yet
Unit6 - 2 Classification-Decision-Trees
36 pages
Machine Learning: Decision Trees: CS540 Jerry Zhu University of Wisconsin-Madison
No ratings yet
Machine Learning: Decision Trees: CS540 Jerry Zhu University of Wisconsin-Madison
49 pages
Lecture 11
No ratings yet
Lecture 11
44 pages
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
No ratings yet
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
73 pages
Trees
No ratings yet
Trees
78 pages
07 - ML - Decision Tree
No ratings yet
07 - ML - Decision Tree
37 pages
LVC 1 Post-Session Summary
No ratings yet
LVC 1 Post-Session Summary
9 pages
T6 Decision Tree
No ratings yet
T6 Decision Tree
38 pages
ML Unit 3 Notes
No ratings yet
ML Unit 3 Notes
117 pages
Chapter 3
No ratings yet
Chapter 3
88 pages
Decision Tree Algorithm
No ratings yet
Decision Tree Algorithm
18 pages
DM Chapter 4
No ratings yet
DM Chapter 4
6 pages
ML Unit 3 Notes-1
No ratings yet
ML Unit 3 Notes-1
118 pages
Decision Trees
No ratings yet
Decision Trees
34 pages
Random Forest Regression
No ratings yet
Random Forest Regression
57 pages
Decision Tree
No ratings yet
Decision Tree
42 pages
Decision Tree
No ratings yet
Decision Tree
20 pages
Unit IV Decision Trees
No ratings yet
Unit IV Decision Trees
37 pages
Lecture 4
No ratings yet
Lecture 4
74 pages
Lec7 - Nonparametric Methods - II
No ratings yet
Lec7 - Nonparametric Methods - II
38 pages
Decision Trees
No ratings yet
Decision Trees
5 pages
SDG Sdgs DF
No ratings yet
SDG Sdgs DF
23 pages
Business Analytics & Machine Learning: Decision Tree Classifiers
No ratings yet
Business Analytics & Machine Learning: Decision Tree Classifiers
60 pages
Decision Tree
No ratings yet
Decision Tree
52 pages
Week 11 - Decision Tree Learning
No ratings yet
Week 11 - Decision Tree Learning
43 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
Decision Trees (I) : ISOM3360 Data Mining For Business Analytics, Session 4
No ratings yet
Decision Trees (I) : ISOM3360 Data Mining For Business Analytics, Session 4
32 pages
Decision Trees in Machine Learning
No ratings yet
Decision Trees in Machine Learning
62 pages
Decision Tree Algorithm: Comp328 Tutorial 1 Kai Zhang
No ratings yet
Decision Tree Algorithm: Comp328 Tutorial 1 Kai Zhang
25 pages
M2 Decision Trees
No ratings yet
M2 Decision Trees
37 pages
Ds 6
No ratings yet
Ds 6
24 pages
Decision Trees
No ratings yet
Decision Trees
45 pages
19 - Decision Tree - ID3
No ratings yet
19 - Decision Tree - ID3
87 pages
Decision Tree Basics for Data Scientists
No ratings yet
Decision Tree Basics for Data Scientists
61 pages
2024 Decision Trees
No ratings yet
2024 Decision Trees
28 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Chap 18 B
No ratings yet
Chap 18 B
22 pages
Decision-Tree Learning .
No ratings yet
Decision-Tree Learning .
29 pages
Decision Trees-Lecture 9&10
No ratings yet
Decision Trees-Lecture 9&10
60 pages
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
No ratings yet
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
83 pages
AIML Lec-11
No ratings yet
AIML Lec-11
18 pages
22.InfoTheory DecisionTrees Short
No ratings yet
22.InfoTheory DecisionTrees Short
25 pages
Machine Learning 10601 Recitation 8 Oct 21, 2009: Oznur Tastan
No ratings yet
Machine Learning 10601 Recitation 8 Oct 21, 2009: Oznur Tastan
46 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
Decision Trees: Classifier
No ratings yet
Decision Trees: Classifier
23 pages
Decision Trees
No ratings yet
Decision Trees
42 pages
Module 3
No ratings yet
Module 3
64 pages
Classification With Decision Trees I: Instructor: Qiang Yang
No ratings yet
Classification With Decision Trees I: Instructor: Qiang Yang
29 pages
MOE All Model Exit Exam Answer
No ratings yet
MOE All Model Exit Exam Answer
60 pages
Decision Tree Algorithm Guide
No ratings yet
Decision Tree Algorithm Guide
25 pages
DWDM Bits
100% (1)
DWDM Bits
11 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
111 pages
Week 8
No ratings yet
Week 8
70 pages
Concept Learning
No ratings yet
Concept Learning
85 pages
Data Mining Tools
No ratings yet
Data Mining Tools
9 pages
A Two-Stage Dynamic Sales Forecasting Model For The Fashion Retail
No ratings yet
A Two-Stage Dynamic Sales Forecasting Model For The Fashion Retail
8 pages
Machine Learning for Money Laundering Detection
No ratings yet
Machine Learning for Money Laundering Detection
80 pages
Models For Machine Learning: M. Tim Jones
No ratings yet
Models For Machine Learning: M. Tim Jones
10 pages
Early Detection for Student Success
No ratings yet
Early Detection for Student Success
39 pages
AIML Lab Manual
67% (3)
AIML Lab Manual
31 pages
ML CLASS 6 Decision Tree Algorithm
No ratings yet
ML CLASS 6 Decision Tree Algorithm
21 pages
ADVANCED WEB DESIGN AND CONTENT MANAGEMENT Report
No ratings yet
ADVANCED WEB DESIGN AND CONTENT MANAGEMENT Report
42 pages
Decision Trees and Dealing With Uncertainty Printable
No ratings yet
Decision Trees and Dealing With Uncertainty Printable
28 pages
Data Mining Using Python Lab
100% (1)
Data Mining Using Python Lab
63 pages
Demand Forecasting
No ratings yet
Demand Forecasting
10 pages
Random Forest Classification
No ratings yet
Random Forest Classification
8 pages
R Package synthpop for Synthetic Data
No ratings yet
R Package synthpop for Synthetic Data
9 pages
Machine Learning and Credit Ratings Prediction in The Age of Fourth Industrial
No ratings yet
Machine Learning and Credit Ratings Prediction in The Age of Fourth Industrial
13 pages
Shweta Mba Project Report Final
No ratings yet
Shweta Mba Project Report Final
74 pages
Implementation of Machine Learning Algorithms To C
No ratings yet
Implementation of Machine Learning Algorithms To C
17 pages
Stanev
No ratings yet
Stanev
14 pages
Dicretization and Conc Hierarchy Details
No ratings yet
Dicretization and Conc Hierarchy Details
4 pages
Prediction of Mechanical Properties of Carbon Fiber Based On Cross-Scale FEM PDF
No ratings yet
Prediction of Mechanical Properties of Carbon Fiber Based On Cross-Scale FEM PDF
8 pages
Building Classification Models - ID3 and C4.5
No ratings yet
Building Classification Models - ID3 and C4.5
1 page
Assignments For Week 4 2024
No ratings yet
Assignments For Week 4 2024
11 pages
Decision Trees For Objective House Price Prediction
No ratings yet
Decision Trees For Objective House Price Prediction
4 pages

Decision Tree

Uploaded by

Decision Tree

Uploaded by

Decision Trees

Input Data Attributes Class

P:2 G:0 P:0 G:2

At each level of the tree, After enough splits, only

P:2 G:2 P:0 G:2

P:2 G:2 P:0 G:2

P:2 G:0 P:0 G:2

The class of a new input can be classified by following the

General Case (Discrete Attributes)

The class of a new input can be classified by following the

General Case (Continuous Attributes)

• Two classes (red circles/green crosses)

How to choose the attribute/value to split on

We want to find the most compact, smallest size

The distribution is not very

The distribution is more

• Pi = probability of occurrence of value i

Entropy The entropy

H1 = -pAlog2 pA – pBlog2 pB H2 = -pAlog2 pA – pBlog2 pB

Example Entropy Calculation

H1 = -pAlog2 pA – pBlog2 pB H2 = -pAlog2 pA – pBlog2 pB

After splitting, a fraction After splitting, a fraction

The average entropy after splitting is:

After splitting, a fraction After splitting, a fraction

The average entropy after splitting is:

We want nodes as pure as possible

IG = H – (HLx PL+ HR x PR)

We want nodes as pure as possible

IG = H – (HLx PL+ HR x PR)

HL = 0 HR = 0.58 HL = 0.97 HR = 0.92

HL = 0 HR = 0.58 HL = 0.97 HR = 0.92

HL = 0 HR = 0.58 HL = 0.97 HR = 0.92

= 20 training examples from class A

Best X1 split: 0.24, IG = 0.138

Split on X2 with 0.234

This node is not pure so we

Best X1 split: 0.22, IG = 0.182

Split on X2 with 0.75

Each of the leaf

Example (X,Y) = (0.5,0.5)

Several data points have exactly the same

Decision Tree Algorithm (Continuous Attributes)

Decision Trees So Far

You might also like