Data Warehousing and Mining (MU)
4 3.1 INTRODUCTION
© There are two forms of data analysis that can be used
for extracting models describing important classes or to
predict future data trends.
‘© These two forms are as follows:
(i) Classification
Gi) Prediction
‘* Classification is a type of data analysis in, which
models defining relevant data classes are extracted.
* — Classification models, called classifier, predict
categorical class labels; and prediction models predict
continuous valued functions.
* For example, we can build a classification model to
categorize bank Joan applications as either safe or
risky, or a prediction model to predict the expenditures
in dollars. of potential customers on computer
‘equipment given their income and occupation.
DH 3.2 BASIC CONCEPTS
In this section we will discuss what classification is,
working of classification, issues in classification and criteria
‘% 3.2.1 What Is Classification?
‘+ Following are the examples of cases where the data
analysis task is Classification
(MU-New Syllabus wef academic year 22-23)
(Classitication)...Paye n,
(i) A bank loan officer wants to analyze th
order to know which customer (loan app.
risky or which are safe.
tag
or
(ii) A marketing manager at a company
analyze a customer with a given profile yy,
buy anew laptop.
+ Inboth ofthe above examples, a model or ls
consircted to predict the categorical lates. 7.
labels are “sky” or “safe” for loan application gg
and “yes” or “no” for marketing data,
ts
‘2% 3.2.2 How Does Classification Work?
the help ofthe bank loan application that we fae
discussed above, let us understand the working 4
classification. The Data Classification process includes two
steps —
1. Building the Classifier or Model
2. Using Classifier for Classification
> 1. Building the Classifier or Model
‘© This step is the learning step or the learning phase.
© In this step the classification algorithms build the
classifier.
* The classifier is built from the training set made up of
database tuples and their associated class labels.
‘* Each tuple that constitutes the training set is refered 19
8 a category or class. These tuples can also be referred
to-as sample, object or data points.
Classification algorithm
youth THEN loan_ decision = 34)
| income = high THEN loan_decision =
IF age = middle_aged AND income = ow
THEN loan_decision = risky
(el rech-WeoPublicotions.A SACHIN SHAM Vertuesees ane Ming uy ne Mining (MU) (Classification)... Page no. (3-3)
2. Using Classifier for Classification
In this step,
ey S lair iS used for clasifiation, Here the est data is used to estimate the accuracy of classification
ENS SESS Heaton res can be applied tothe new data tuples if he accuracy is considered acceptable.
(ohn, middle_aged, low)
loan_decision?
risky
(WcaFig. 3.2.2 : Testing a Classifier
3.2.3, Classification Issues
‘The major issue is preparing the data for Classification
Paring the data involves the following activites:
Data Cleaning : Data cleaning involves removing the
noise and treatment of missing values. The noise is
removed by applying smoothing techniques and the
problem of missing values is solved by replacing a
missing value with most commonly occurring value for
that attribute.
Relevance Analysis : Database may also have the
invelevant attributes. Correlation analysis is used to
know whether any two given attributes are related.
Data Transformation and Reduction : The data can
be transformed by any of the following methods
@ Normalization : The data is transformed using
normalization. Normalization involves scaling all
values for given attribute in order to make them
fall within a small specified range. Normalization
is used when in the learning step, the neural
networks or the methods involving measurements
are used.
Gi) Generalization: The data can also be transformed
by generalizing it to the higher concept. For this
Purpose, we can use the concept hierarchies.
(iii) Data can also be reduced by some other methods
such as wavelet transformation, binning,
histogram analysis and clustering.
1U-New Syllabus wef academic year 22-23)
3.2.4 Comparison of Classification Methods
Here are the criteria for comparing the methods of
Classification
* Accuracy : Accuracy of classifier refers to the ability
of classifier. It predicts the class label correctly and the
accuracy of the predictor refers to how well a given
predictor can guess the value of predicted attribute for
anew data
Speed : This refers to the computational cost in
generating and using the classifier or predictor.
Robustness : It refers to the ability of classifier or
Predictor to make correct predictions from given noisy
data,
* Scalability : Scalability refers to the ability to
construct the classifier ot predictor efficiently; given
large amount of data.
Interpretability : It refers to what extent the classifier
ot predictor understands.
3.3 DECISION TREE INDUCTION
Why is tree pruning useful in decision tree |
induction? What is a drawback of using a separate |
set of tuples to evaluate pruning? Given a decision |
tree, you have the option of (a) converting the !
decision tree to rules and then pruning the
resulting rules, or (b) pruning the decision tree
and then converting the pruned tree to rules.
‘What advantage does (a) have over (b)? :
Tech-Neo Publications. SACHIN SHAH VentureData Warehousing and Mining (MU)
UQ. Why is tree pruning useful in decision tree 1
induction? What is a drawback of using a separate |
set of tuples to evaluate pruning?
‘© Decision tree induction is the learning of decision trees
from class-labeled training tuples.
* A decision tree is a structure that includes a root node,
branches, and leaf nodes.
© Each internal node denotes a test on an attribute, each
branch denotes the outcome of a test, and each leaf
‘node holds a class label.
* The topmost node in the tree is the root node.
* The following decision tree is for the concept
buys_laptop that indicates whether a customer at a
‘company is likely to buy a computer or not. Each
internal node represents a test on an attribute. Each leaf
node represents a class (cther buys_laptop = yes or
buys_laptop = no)
(0c9Fig. 3.3.1: Representation of a Decision Tree
© The benefits of having a decision tree are as follows:
1. _ It.does not require any domain knowledge.
2. Itis easy to comprehend.
3. The leaming and classification steps of a deci
tree are simple and fast.
3.3.1 Dedsion Tree Induction Algorithm
‘© A machine Jearning researcher named J. Ross Quinlan
in 1980 developed a decision tree algorithm known as,
IDS (Iterative Dichotomiser). Later, he presented C4,5,
which was the successor of ID3, 1D3 and C45 adopt 4
greedy approach. In this algorithm, there is no
backtracking; the trees are constructed in a top-down
recursive divide-and-conquer manner.
(MU-New Syllabus w.ef academic year 22-23)
Algorithm : Generate_decsion tre, Go
sion tree form training tuples =
decision PS fda pa
Inpot
«Data partition, D, which is a set of trainin
8 ty
their associated class labels. Des
ng
«attribute_list, the set of candidate ature,
© Attribute_selection_method, a procedure 1
the spliting criterion that “best” parton, ye
tuples into individual classes. This criterion ng
splining_atribute and iter a sping puny»
splitting subset.
Output : A Decision Tree
5 Method
1. create anode N;
2. ifftuples in D are all of the same class, C.then
3, return N as leaf node labeled with class C;
4, if.attribute_list is empty then
5. retum N as a leaf node with labeled with mejor
in D; //majority voting
6. apply attribute_selection_method(D, attibu_is,
find the best splitting_criterion;
7. label node N with splitting_criterion;
8. if splitting attribute is discrete-valued and multi
splits allowed then// no restricted to binary tees
9. attribute_list€-attribute_list - spliting anribue;
remove splitting attribute
10. for each outcome j of splitting criterion
1/ pastition the tuples and grow subtrees for
partition
11. let D; be the set of data tuples in D sa
J://a partition
12. iD, is empty then
13. attach a leaf labeled with the majority ss "" >
node N;
14. else attach the node returned by Generate_¢2"
(©, attribute list) to node N;
fying ous
isiois arohousing and Mining (MU)
2 Tree Pruning
pt
‘The decision tree built may overfit the training data
‘There could be too many branches, some of which may
reflect anomalies in the training data due to noise or
outliers.
‘Tree pruning addresses this issue of overfitting the data
by removing the least reliable branches (using
statistical measures).
‘This generally results in a more compact and reliable
decision tree that is faster and more accurate in its
classification of data.
‘There are two approaches to prune a tree:
( Prepruning - The tree is pruned by halting its
construction early.
(W) Post-pruning - This approach removes a sub-tree
froma fully grown tree.
Drawback of using a separate set of tuples to evaluate
pruning
Ifa separate set of tuples are used to evaluate pruning
is that it may not be representative of the training
tuples used to create the original decision tree.
If the separate set of tuples are skewed, then using
them to evaluate the pruned tree would not be a good
indicator of the pruned tree’s classification accuracy.
Furthermore, using a separate set of tuples to evaluate
pruning means there are less tuples to use for creation
and testing of the tree. While this is considered @
drawback in machine learning, it may not be so in data
mining due to the availability of larger data sets.
nere. ity OF Ee ee acsei
"Gq Given a decision tree, you e
| egivaring the decision ee to rues-and then,
‘ ‘pruning the resulting rules. or. (>)
j "decision tree and then converting the pruned te® |
q - toutes. What advantage does (a) have over (6)? 1
Tf pruning a subtree, we would remove the subtree
completely with method (b). However, with method (a), if
ssificatios
YW 3.3.3 Cost Complexity
‘The cost complexity is measured by the following (™°
parameters —
‘Number of leaves in the tree, and
Error rate of the tree
2%. 3.3.4 Classification using Information Gain
(D3)
1D3 stands for Iterative Dichotomiser 3 and is named
such because the algorithm iteratively (repeatedly)
dichotomizes(divides) features into two or more groups
at each step.
Invented by Ross Quinlan, ID3 uses a top-down
greedy approach to build a decision tree. In simple
‘words, the top-down approach means that we start
building the tree from the top and the greedy approach
‘means that at each iteration we select the best feature at
the present moment to create a node.
Most generally ID3 is used for classification problems
‘with nominal features only.
Metrics in 103
‘As mentioned previously, the ID3 algorithm selects the
best feature at each step while building a Decision tree.
1D3 uses Information Gain or just Gain to find the best
feature.
Information Gain calculates the reduction in the
entropy and measures how well a given feature
separates of classifies the target classes. The feature
with the highest Information Gain is selected as the
best one.
In simple words, Entropy is the measure of
disorder and the Entropy of a dataset is the measure of
disorder in the target feature of the dataset. Expected
amount of information (in bits) needed to assign a class
to a randomly drawn object is called Entropy.
In the case of binary classification (where the target
column has only two types of classes) entropy is 0 if all
values in the target column are homogenous(similar)
and will be 1 if the target column has equal number
pruning a rule, we may remove any precondition of it. The values for both the classes.
latter is less restrictive. © We denote our dataset as D, entropy is calculated
i
(MU-NeW Syllabus wef academic year 22-23)
TR rech-eo Publcations.A SACHIN SHAH VentureData Warehousing and Mining (MU)
1
Enropy ) = © py toa
where,
1s the otal numberof classes inthe target con
2, isthe probability of clas‘ or the ratio of “number
of rows with clas | in the target colina” 10 the “otal
rmiberof roms” in he dataset
Information Gain for a feature column A is calculated
Gain(A) = Entropy(D) ~ Entropy(A)
13 Steps
1. Calculate the Information Gain of each feature.
2. Considering that all rows don’t belong to the same
class, split the dataset D into subsets using the feature
for which the Information Gain is maximum.
Make a decision tree node using the feature with the
‘maximum Information gain,
4. If-all rows belong to the same class, make the current
node as a leaf node with the class as its label.
5. Repeat the above steps for the remaining features until
‘we tun out of all features, or the decision tree has all
leaf nodes.
Ex 9.3.1: Apply ID3 algorithm on the following training
dataset and extract the classification rule from the tree.
Classiicaton)..Pape no (a.
Soin.
Let the class label attributes be as follows:
C1 = Play_Tennis = Yes = 9 Samples
2=Play_Tennis = No = 5 Samples
‘Therefore, P(C1) = 9/14 and P(C2) = 5/14
(i) Entropy before split for the given database D:
m0) = Ente )= Feats Sig,
04097 + 05305 = 0940
Ci) Choosing Outlook as Spiting Atvbute
Outlook ct C2 Entropy
Play Tennis = Yes | Play Tennis=No|
sunny 2 3 srt
Overcast 4 oO 0O
Rain 3 2 oar
1 Hotook) = x HiSumny) + Ax Hover
> H(Rain)
5 4 5
= 74% 0971 479x045 x0.970
Gain (Outlook) = H(D) — H(utlook) = 0.940 — 0.694
= 0.246
pe (ii) Choosing Temperature as Splitting Attribute
Day | Outiook | Temp. | Humiaty | Wind | Play Tennle
1 [sien [ret | Hah [weak ae Temperature | C1 c2 | Entropy
2 | Sumy | Hot | High | Strong | No ise eee, | Plas Tonal |
3 | Overcast | Hot [High | Weak | Yes acs
4 [Rain [wig | High | weak | ves ae 2 2 ‘
5 [Ran | Coot | Normal | weak | Yes Mita 4 2 092
8 [Rain | Coot | Normal | strong | No Cool 5 1 ost
7 | Overcast | Coot [Normal | Strong | Yeo a : a
8 | Sunny | mid | High | Weak No ++ H (Temperature) Ja% Hon + 74% H(Mild) + 7p
9 [sunny [Coot | Nomar | weak | Vee x H(Cool)
‘40 | Rain | wig_| Normal | Weak | Ves six 4G x092 4x08 <0911
1 sunoy [wis | Normat_| Suen | Yes
12 [Oweas | wis | Hips | Steg | Gain (Temperature) = H(D) ~ H(Temperature)
13 | Overcast | “Hot_[ Nomar] weak | Yer = 0940-0911 =0.029
14 [Rain [wid | High [song Tne
(MU-New Syllabus we. academic year 22-23)
Tech-Neo Publications..A SACHIN SHAH VentureAutribure
Classification
Day [Outlook | Temp. | Humidity | Wind | Play_ Tennis
A | 1] Sunny | Hot itn | No
Pay Tenole= No} 2 | Sunny [ Hot [High | Strong No
4 [_ oses 8 | Sunny [ mid | High | Weak | No
; | ose 9 | sunny [ Coot | Normal | Weak Yes
1 11 | sunny [ mid | Normal [Strong] Yes
Humidity) = 74 x High) 4 *H (Normal)
As 7
s 14% 0.985 +74 x 0.592 = 0.789
Gain (Humidity
HO) - Humidity)
= 0.940 - 0.789 = 0.151
(9) Choosing Wind as Splitting Attibute
Let the class label atributes be as follows:
Cl = Play_Tennis = YestSunny = 2 Samples
C2 = Play_Tennis = NolSunny = 3 Samples
‘Therefore, P(CI) = 2/5 and P(C2) = 3/5
(Entropy before split for the given database DI:
HO) = 2 arlos,(2)
Gi) Choosing Temperature as Splitting Attribute
2, 5.3, 5
Flog, 5 +5 log, 3= 0971
Wind, ch Hs Entropy
Play_Teonis = Yes | Play-Tennis=No |
‘Strong 3 3 1
Weak 6 2 0811
‘Temperature cl a Entropy
Play Tennis | Play Tennis} Hf
«| = ¥esiSunny | =No'Sunny | <><
6 8g
. H(Wind) = 74 x H(Strong) +74 XH (Weak) Hot 0 2 °
A Mild 1 1 1
=ygXl + 7qX 0811 = 0.892 Cool i 3 2
Gain (Wind) = H(D) - H(Wind) = 0.940 ~ 0,892 = 0,048 1. Hetenpertre) = 2xxigon +2 cna +
Gain(HumidityiD) = 0.151
Gain(WindID) = 0.048
Outlook attribute has the highest gain; therefore, it is
wed as the decision attribute in the root node. Since,
Outlook has three possible values, the root node has three
branches (Sunny, Overcast, Rain),
‘Sunny ‘Overcast Rain
(coFig. P. 3.3.1(@)
Now, consider Outlook = Sunny and count the number
‘ftupes from the original dataset D. Let us denote it as D1.
(MU-New Sylabus w.ef academic year 22-23)
x H(Cool)
Sytem
= 5x0+5x 145 x0=04
Gain (Temperature) = H(D1) ~ H(Temperature) =
0.971 -0. 1.571
(ii) Choosing Humidity as Splitting Attribute
‘Humidity | C1 a Entropy
Play Tennis | Play Tennis | ©
= Yes\Suriny | =NoSunny
High 0 3 o
Normal 2 o [| o
x H(High) + 2 XH (Normal)
Gain (Humidity) = H(D1) - H(Humidity) = 0.971 -|
197
as Splitting Attribute
(iv) Choosing
Tech-Neo Publications..A SACHIN SHAH VentureSummary:
Gain(TemperatureD1) = 0.571
Geain(HiumidityD1) = 0.971
Gain(WindiD1) = 0.02
Humidity attribute has the highest gain; therefore, itis
placed below Outlook = “Sunny”.
Since, Humidity has two possible values, the Humidity
‘nade has two branches (High, Normal).
From dataset D1, we find that when Humidity = High,
Play Tennis = No and when Humidity = Normal,
Play_Teanis = Yes.
(weaFig. P33.100)
‘* Now, consider Outlook = Overcast and count the
‘number of tuples from the original dataset D. Let us
105) Fig, P.3.3.1(c)
Now, consider Outlook = Rain and count ie.
of tuples from the original dataset D. Let ys jg,
as D3.
Day | Outlook | Temp. | Humidity | Wind
«| rain | Mid | High | Weak
5 | Rain | Cool | Normal | Weak
6 | pan | Cool | Normal | Stong) ty
10 | pan | mid | Nomal | Weak | es
4 Rain Mild High Strong te
Let the class label attributes be as follows:
C1 = Play_Tennis = YeslRain = 3 Samples
C2 = Play_Tennis = NolRain = 2 Samples
Therefore, P(C1) = 3/5 and P(C2) = 2/5
(i Entropy before split for the given database D3:
¥ nmuQ)
H(D3)
0
+ HO3)
0
Bue, 242:
oe, 3 +5 oe:
i)_Choosing Temperature as Spliting Attribute
Entrops
H
denote it as D2, ie
ie 0 :
Day | Outiook | Temp. | Humidity | Wind | Play Tennis 2 mo
3 [Overcast | Hot | igh | Weak | Yes ‘Cool 1 1 J
7 [overcast Coot [ Noma [ srong | Yee 9
i 3 2 co
12 [overcast wi [Hoh _|Srong [ Yes | (Cempertar) = 5 HH) + HM) + 5° I
o
19 [Overcast | Hot [noma Weak [ves | §x0+3xoo18+2x1=0951
© From dataset D2, we find that for all values of Outlook
= “Overcast”, Play_Tennis = Yes.
(MU-New Syllabus we f academic year 22-23)
Gain (Temperature) = H(D3) ~ Temperature)
0.971 -0.951 =0.02
B]
Tech-Neo Publications..A SACHIN SHAMjarehousing and Mining (MU)
‘Choosing Wind as Splitting Attribute
2
HWind) = 3x1 (Strong) +2 x H (Weak)
2x043x0-0
Gain (Wind) =
H(D3) ~ H(Wind) = 0.971 -0 =0.971
.02
971
Wind attribute has the highest gain; therefore, it is
placed below Outlook = “Rain”.
Since, Wind has two possible values, the Wind node
has two branches (Strong, Weak).
From dataset D3, we find that when Wind = Strong,
Play_Tennis = No and when Wind = Weak, Play Tennis =
(en Fig. P.3.3.1(4)
‘The decision tree can also be expressed in rule format
as:
TFOutlook = Sunny AND Humidity
= High THEN Play_Tennis = No
IF Outlook = Sunny AND Humidity
= Normal THEN Play_Tennis = YES
IF Outlook = Overcast THEN Play_Tennis = YES
IF Outlook = Rain AND Wind
= Strong THEN Play_Tennis = No
TF Outlook = Rain AND Wind
= Weak THEN Play_Tennis = YES
(Classification)....Page no_(3-9)
Ex. 33.2 : A simple example from the stock market
involving only discrete ranges has profit as categorical
attribute with values (Up, Down} and the training data is:
ae]
Age | Competition | Type _| Profit
‘oid | Yes _| Software | Down
old No Software | Down
Old No Hardware | Down
Mid Yes Software | Down
Mid L Yes Hardware | Down
Mid] No _| Hardware |_Up
Mid No Software Up
New ‘Yes Software Up
New No Hardware | Up
New No | Software Up
Apply decision tree algorithm and show the generated rules.
Soin. :
Let the class label attributes be as follows:
Cl = Profit=Down=5 Samples
C2 = Profit= Up=5 Samples
‘Therefore, P(C1) = 5/10 and P{C2) = 5/10
(Entropy before split forthe given database D =
HO) = = piles, (5)
int
5, 10,5) 1
© HD) = 7ploe gs + 7p lon 5 =05+05
(ii) Choosing Age as the Splitting Attribute.
os (Age) = px HOlg) +75 x H Mita)
+x HOew)
= Bxovters deacon
Gain (Age) = H(D)-H(Age) = 1-0.4=0.6
(MU-New Syllabus we academic year 22-23)
UB rech-neo Publications A SACHIN SHAH Venturevorts
res
a
6
+. H (Competition) =: 4 x H(Yes) +79 H (No)
0.9183
4 6 = 0.8755
= 7px08i13+-Sx 09183-0875
Gain(Competition) = H(D) - H(Competition)
= 1-0.8755 = 0.1245
Gv) Choosing Type as the Splitting Attribute.
i
4 x H(Software) +4 H (Hardware)
ren
= 9X! +79xl=1
HO) - H(Type}
Gain(Age) = 0.6
Gain(Competition) = 0.1245
Gain (Type) = 0
‘Age attribute has the highest gain; therefore, itis used
as the decision attribute in the root node.
Since Age has three possible values, the root node has
three branches (Old, Mid, New),
From dataset we find that,
IF Age = Old THEN Profit = Down
IFAge = Mid THEN Profit = Down OR Profit= Up
IfAge = New THEN Profit = Up
Old Mia New
Down 2 Up
(eaFig. P.3.3.2(a)
[a]
Mid
Yes
No
Lette class label attributes be as foons
C1 = Profit= Down = 2 Samples
C2 = Profit = Up=2 Samples
Therefore, P(Cl) = 2/4 and P(C2) = 274
(Entropy before split forthe given database py
Sami
2,094.2, 4
$198.3 + §108,5=05 +05)
HO) =
+ HOL =
2
+ H (Competition) = $x H(Yes) + be HNo)
2
=4 xo42 x0=0
Gain(Competition) = H(D1) - H(Competton)
© H(Type) =
ay HiSoftware) +2 cH (Hard
"
7
paisderet
Gain(Competition) = H(D1) - H(Competition) =!~
Gain(Type) = H(D)~H(Type) = 1-1 =0
(MU-New Syllabus w.e f academic year 22-23)
H Tech-Neo Publications..A SACHIN SH’rousing and Mining (MU)
owt (Classification) (3-11
son,
sone petitioniDI
necompestenD)) Lat he clas label atrbutes be as fllows:
Gain CTypetD1) = 0 Cl = OwnHouse = Yes =7 Samp
eee = Own House = Yes =7 Samples
3 aturibute has the highest gain; therefore,it | C2. =
Chraved below Age="Mid™
Competition has two possible va
“nce. lues, th
e yn node has two branches (Yes, No).
rom dataset DI, we find that when Competition =
Yes Profit = Down and when Competition = No, Profit
2Up.
(ooFig. P.3.3.2(a)
‘The decision tree can also be expressed in rule format as:
Old THEN Profit
Mid AND Competition = Yes THEN Profit
= Down
IF Age =
IF Age
IFAge = Mid AND Competition = No THEN Profit = Up
IfAge_= New THEN Profit = Up
Ex. 333
Using the following training data set, create classification
‘model using decision tree and draw the final tree.
Age _| Own House
Young _| Yes
Medium | Yes
‘Young _| Rented
Medium | Yes
Yes
Young _| Yes
Old Yes
Medium | Rented
Medium | Rented
‘old | Rented
| Old _} Rented
Young _| Yes
old Rented
oe ee
(Mu-t
MU-New Syilabus wes academic year 22-23)
‘Own House = Rented = 5 Samples
‘Therefore, P(C1) = 7/12 and P(C2) = 5/2
() Entropy before split for the given database D:
HD) = z toes (5)
2H) = Fploe, #2 + F tops"? = 0.454 +0.526
= 0980
(ii) Choosing Income as the Splitting Attribute
Income a @ estoy
Own House = Yes | Own House = Rented 4
Vor 2 ° 0
Hoh
Hoh ~ [oo °
Medium 1 2 0918
low a 3 o |
2 Wdncome) = 2x H(Very High) +75 x H (High)
+ xt Medium + 35H (Low)
= Sxo+dxo+ Fxosis+dxo
= 0229
Gain(Income) = H(D)—Hincome)
= 0.980-0.229 = 0.751
iii) Choosing Age as the Splitting Attribute
4
x Hovoung) + 35% Medium
2
[_ oste
+ 3xH os)
4
72x08 + xo971 + 73x 0918
0.904
HI Tech-Neo Publications...A SACHIN SHAH VentureGain(tnonme) = 0.751
Gain Age) = 0.076
‘Vecome artribute has the highest gain; therefore, itis used as
‘the decision attribute in the root node.
Sexe Income has four possible values, the root node has
‘oar beanches (Very High, High, Medium, Low).
WT a
Low
[Medium
emg Fig. P3.3%a)
Now, consider Income = Very High and count the
‘amber of tples from the original dataset D. Let us denote
zs
Stace both the tuples have class label Own House =
Nex we drcaly give “Yes as class below Income = Very
High
Feo
vor
Low
|Medium
act Fly, P3.3.3(b)
"orm, consider Income = High and count the umber of
tophes Sr0%n the original dataset 1), Let us denote it ws 2,
[Pee [ane]
‘howe the Your tuples have clan,
Te, we estly Wve "eR wn ely
Win
Mbel Own Houne «
™ below Income =
YAY Neen efter WAN axe ns 19-75
act Fig. P33.3()
Now, consider Income = Low and oo,
he
tuples from the original dataset D. Let
4 den
Since all the thee tuples have class ite Oxy e
Rented, we directly give “Rented”
aS a cay
Income = Low.
Very High, Low
Yes Rena
ct Fig. P.3.3.3(d)
Now, consider Income = Medium ani o
‘number of tuples from the original dataset Lets
itas D4,
We find that,
W income = Medium AND Age
= Young THEN Own Howe =)
TF lncome = Medium AND Age
®
= Medium THEN Own Hous
Income = Medium AND Age
Old THEN Own House = 8°"
wl
[ech Neo Publications A SACHINRented
Yeo
acts Fig. P3.3.3(0)
Rented
-pe decision tree can also be expressed in rule format
© eicome = Very High THEN Own House = Yes
High THEN Own House = Yes
Low THEN Own House = Rented
‘Medium AND Age
= Young THEN Own House = Yes
‘Medium AND Age
Medium THEN Own House = Rented
‘Medium AND Age
(Old THEN Own Hous
vec 334 KUEENEES
‘The following table consists of training data from an
employee database. The data have been generalized. For
example, “31... 35” for age represents the age range of
31 to 35. For a given row entry, count represents the
‘number of data tuples having the values for department,
status, age, and salary given in that row.
IF Income =
IF Income
IF income
() How would you modify 4
Algorithm to take into consideration the count of
generalized data tuple (i.e, of each row entry)?
(b) Use your algorithm to construct a decision tee fom
the given data.
© som.
(1) The basic decision tre algorith
follows to take into consideration the ou
generalized data tuple
«The count of each tuple must be integrated into the
Calculation of the attribute selection measure (Sve
m should be modified as
mnt of each
as information gain).
«© Take the count into consideration to determine the
‘most common class among the tuples.
(b) Use 1D3 algorithm to construct a decision tree from
the given data
Let the class label attributes be as follows:
C1 = status = junior = 113 Samples
C2 = status = senior = 52 Samples
‘Therefore, P(C1) = 113/165 and P(C2) = 52/165
(i) Entropy before split forthe given database D:
H@) = = pilog, Q
mt
fis’: 165 52 165)
HD) = 795181713 165 82 sz
0
= 0.3740 + 0.5250 = 0.899
))_ Choosing department atthe Splitting Attribute.
46K... 50K | 30 eee Se
—— ‘Status = junlor
26K....30K | 40 =
3IK...39K | 40
Systems 2
[4ox...s0x [20
Marketing 4
(66K...70K i
46K...50K tary :
: 110
(66K...70K. -H (department) = 795 x H(Sales) +35 x H
A650 {AC (systems) +2 x H (marke 10.
atketing | junior 41K...45K 165 ting) + }65 * H (secretary)
i a 31
Secretary [senior | 46...50 | 36K...40K | 4 Tes * 08454 + 7G x 0.8238 + 1
tate _[[innior [26.0 [ 26x...30x Lo x0,8631 +22
—Ststatus be the class 8631 + 765 * 0.9709 = 0.8504
label attribute.
(wu,
New Syllabus wef academic year 22-23)
ae
Tech-Neo Publications...A SACHIN SHAH Ventu;
reData Warehous!
Gain(department) = H(D) — H(department)
= 0.899 — 0,8504 = 0.0486
and Mining (MU)
‘Summa
Gain(department) = 0.0486
Gain(age) = 0.4247
Gain(salary) = 0.5375
ite has the highest gain; therefore, i
the root node. :
salary attribut
the decision attribute in
salary has six possible valUes, the r00t node
ing Attribute.
Cc cz. Entropy
Status={ yr_| Status=senior_ A
20 0 0
49 0 o
7 44 35 0.9906
36...40 o 10 0
41.45 0 3 2
46...50 0 4 |
ry
Hage) = PO x HOL..25) +795 *H 26-30)
10
+ x 81.38) +769 H BHM)
4
ag Hn A5) + 795% HA 5)
20 49 542 oul
= Phx 04 ex 0+ 765 * 0.9906 +765
3 + xos
x04 gag x 0+ 7g5%x0= 04743
Gain(age) = H(D)- Hage)
0.899 - 0.4743 = 0.4247
46,
165
xHGI
40
x HQ6K....30K) +765
4
38K) +765
xH GOK. is
IK) +7ggxH (41K....45K)
6. 8
+765 %H (46K.....50K) +765 x H(66K....70K)
4.
65
8
+ 765% 0= 0.3615
(one
0.4 765% 0476504785 x04 x 0.9468
Gain (salary) = H(D) ~H(salary)
= 0.899
branches (G6K..-30K, 31K...35K, 36K...40K, 43K
46K...50K, 66K...70K). ,
From the dataset D, we find that
[F salary = 26K.-.30K THEN status = junior
IF salary = 30K...35K THEN statu
IFsalary = 36K...40K THEN statu
SR salary = 41K...45K THEN status = junior
TFsalary = 46K...50K THEN status = junior:
status = senior
IF salary = 66K...70K THEN status = senicr
Only one branch ie. 46K...50K is not giving 3
class label attribute.
junior
junior
senior junior
coisFig. P. 33.4(0)
Now, consider salary = 46K...50K and
number of tuples from the original dataset D. Let us <*
itas D1
department
Sales
Systems ce
‘Systems Ee
marketing ee
Let the class label attributes be as follows:
C1. = status = juniorlsalary = 46K...50K =
C2. = status = seniorisalary
46K...50K = 40 Samples
(MU-New Syllabus wef academic year 22-23)
Therefore, P(C1) = 23/63 and P(C2) = 40/63and Mining (MU)
pefore split for the given database Dy.
ato
@ A 1
HOD = 2 reel)
2,
HOD = 63 n+ 6
"29 0.9468
a a
Statussjunlor | Status-senior Te
30 0
B 0 0
0 10 0
0 0
2
x H(sales) ‘8 XH (ystems) +12
2H (marketing) + 2
O92 xoxo et.
= Bro Bor Peo Sxo=0
S xt cecetay)
Gain(depatment) = H(D1) —H(department)
9468-0 = 0.9468
ip Choosing age asthe Splitting Attribute
= ct a Entropy
Statussjunior_| Statusssenior | _H
21.25 20 0 0
%..30 3 0 0
3.35 0 30 0
5A 0 10 0
a5| 0 0 0
46..50 0 al 0 0
3 30
2 Hlage) 2 HQ Dvgrtas ae
XH1..35) +49 66 +8
XH NAS) +B xH (46...50)
a Bxedcos B04 x0ed x04 x0e 0
Both the aribues have the same gun; therefore, we
‘one of the attribute arbitrarily and place it below
SY =46K..50K",
MINeW Slabs wes cadenicyeat 22-23)
40 [Medium [No | Fair Ye
5 ves
6 we
7 ves
8 %
9 ves
ves
ves
‘Yes
ves
\
e
UB rech.neo Pubizaions.A SACHIN SH
dfoer X, be Credit sang = “Pai”
ocompote: OXI)
rom Naive Bayesian Classification
pec) = HH, Pow)
PRIC) = POGIC) POKIC) POGIC,) POX IC))
(__Age<=30 __) _2
ocicy = MG mpm Yea) “3
Incomes Medium), 4
POGC) = PBuys_Computer = Yes) "9
Student= Yes __) _6
voc) = Agape tompmars Fa) “9
6
POKIC) = =5
From Naive Bayesian Classification
24,66
PIKIC) = 5x9xg%5 = 0.044
POOC)PC) = 0044 x7; = 0.028
‘To compute: P(KIC,)
From Naive Bayesian Classification
Fone) = Th Poca
PRKIC,) = POK,IC,) Sea
= p(——Ases=30__)_3
rove = HG pers No)"
PoKsc) = p(_lacome = Medium) _ 2
= ges taamucr= Ne) "3
= Student = Yes) _1
ME) = juys_Computer = No, )=3
PoKJc) = p(Lredit_rating = Yes) _2
0) = (Gaye compute) 5
Fror
™m Naive Bayesian Classification
wy
New Sjlabus we academic year 22-23)
Bre
(A)
se2eie2
Bxdagxd
PUKIC) = 3x Ex gx
PEXIC) PC) = 0.019% = 0.007 a2
[Naive Bayesian classification will assign a sample X 0
class Gif and only if,
PLCIX) > P(CIX)
_ RXIGDPIC) , PoxteyPicp
P(X) P(X)
P(X) is constant for both classes C, and C,-
+ PUKIC) P(C) > PEXIC) PCC)
From (A), (B) and (C),
P(KIC,) P(C,) > PCXIC,) PCC)
We conclude that the unknown sample
X = (Age “Medium”, Student =
"Yes", Credit_rating = belongs to class
Buys_Computer = Yes.
iii
Ex. 342: Apply statistical based algorithm to obtain the
actual probabilities of each event to classify
X = (Dept = “Systems”, Status = “Junior”, Age
Use the following table:
oC)
“<=30", Income =
“Fair")
“26 -30")..
Dept | Sits [ge [Saar] Cot
sacs [senor [3135 [a0 [30
sari [2690 | a= [40
sass [pone [2630 [1x35 | 40
Syren [ur [21-25 [ aa a0 [20
sysens [ur [1-35 | 6x70 [5
Sytem [nor [2630 | 45x soe | 5
Systems | Senior [41-45 | 66K — 70k | 5
© som:
(Class label attribute is Salary.
= Salary between 46k ~ 50k = 55 Samples
Cy= Salary between 26k — 30k = 40 Samples
= Salary between 31k 40k = 40 Samples
Cy = Salary between 66k ~ 70k = 10 Samples
40. 40. 40,
POC) = Tyg and PC) = A
Letevent X; be Dept
“Systems”
Tech-Neo Publications..A SACHIN SHAH VentureData Wi
event X; be Status = “Junior”, and
‘event X; be Age = "26 -30"
To compute: P(XIC))
From Naive Bayesian Classification
Paxic) = Ff, pac)
PII) = POX IC) PEXIC)) POGIC,) POC)
Dept = Systems )_25
POKIC) = » eet 35
= Junic 25
recy = w(t) 8
Age = 26 - 30 5
POGIG) = Hen Feros
From Naive Bayesian Classification
25.5
POKIC,) = Fexggx55= 0.019
PRRICYPIC) = 0019x35=0.007 (A)
To compute: PCKIC)
From Naive Bayesian Classification
Poxic) = I POGIC)
PIKIG) = PCK,IC;) POGIC,) POGIC,) POGIC,)
= p(Dept= Systems) _ 0.
PO IC) = as = 40
Status = Junior) 40
Poni) = Paya) =s
Age = 26-30 40
POG) = P(Satty 26k 300)
From Naive Bayesian Classification
0
POI) = apxHPxM oo
POO RC) = oxft=0 (8)
To compute: P(X | C,)
From Naive Bayesian Classification
roy = foxes
PIC) = oe POYIC) POKIC,)
= =Systems) 0.
noc = eee) a
(Classification).
‘Status = Junior
P(X,IC)) = Ceri ee 0
Age = 26-30) 49
lary 31 k— 35k) = 49
PUX{IC3) = P|
fication
From Naive Bayesian Cla:
o
PRKIC)) = 49,
AO Ao
P(KIC) PICs) = OX 745 = ‘
To compute: P (XIC.)
From Naive Bayesian Classification
PKC) = TT POC)
POX IC) POGIC,) POGIC,) PX Jc,
Depts Sysens.) 10
POUIC) = P(e eet FOr =10
PXIC,) =
( Age=26-30) 0
POGIC.) = P (Salary 66 k— 70k) = 10
From Naive Bayesian Classification
10, 5 0
PUKIC) = 79%79*19=9
10
PRI) PIC) = Ox74g=0 0
Naive Bayesian classification will assign samgl \
class C; if and only if,
P(CAX) > P(CIX)
ie. POKIC) P(C) ed PCKIC) PCC)
P(X) PX)
P(X) is constant for both classes C, and CG
+. PCRIC) PCC) > PEXIC) P(C)
From (A), (B), (©), (D) and (E)
P(KIC,) P(C,) > P(KIC,) P(C,) 2 P(XIC,) P(C,)2 PC!
PC.)
~. We conclude that the unknown sample
X-= Dept = “Systems”, Status = “Junior”, Age ="
belongs to class C, = Salary between 46k SIH
sip”
Ex. 9.4.3 : Given the training data for height cl
Classify the tuple, t = using
Classifica
(MU-New Syllabus wef academic year 22-28)
cyst e
Tech-Neo Publications. SACHIN SHAK"housing and Mining (MU)
Name | Gender | Height | Output |
Kiran F_[16m_[ Shor
Jatin M_ [2m [tan
Madhuri F | 1.09m | Medium
Manisha | F_ [188m | Medium
Shilpa | F_ | 1.7m_| Shon
Bobby M__|1.85m | Medium
Kavita F_ [16m | shor
Dinesh | M_ | 1.1m_| short
Rahul M_ [22m [Ta
Shree M_ [21m [Tall
Divya F [1.8m | Medium
Tushar | M_|1.95m_ | Medium
Kim F_[19m_[ Medium
Aart F [1.8m _| Medium
Rajasbree | F _ | 1.75m_| Medium
Soin. : From the above table, it is clear that there are 4
tuples classified as Short, 8 tuples classified as Medium and
3 tuples classified as Tall
We divide the height attribute into six ranges as given
below:
(3-19)
(Classificad
From the given training data, we estimate
P(Short) = 4/15
P(Medium) = 8/15
Petall) = 3/15
‘The unseen tuple is t=
‘P(UShort) x P(Short) = P¢MiShort) x P(1.9:2.0Short)
x P(Short)
eheoxt =
(Medium) x P(Medium)= P(MIMedium)
x P((1.9.2.0)/Medium) x P(Medium)
218
= PQr ys = 00167
thal) x P(tall) = POMItal!) x P((1.9.2.0)tall) x P(all)
3132
= XG jg = 0.067
Based on the above probabilities, we classify the new
tuple as Tall because it has the highest probability.
Ex 344 : Given the training data for credit transaction.
‘Classify a new transaction with (Income = Medium and
credit = Good) using Naive Bayes classification.
(0.1.6), (1.6.1.7), (1.7,1.8], (1.8,1.9}, (1.9,2.0) and 2.0, a0)
Income | Credit | Decision
Very high | Excellent | AUTHORIZE
High Good _| AUTHORIZE
Medium | Excellent | AUTHORIZE
High Good _| AUTHORIZE
[very high | Good _| AUTHORIZE
Medium | Excellent | AUTHORIZE
High Bad REQUEST ID
tight | 0.18)
aa 8 Medium | Bad REQUEST ID
(1718) 9 High Bad a REJECT
(189) 10 | Low Bad CALL POLICE
azo | 0
20,0q]| 0
Tota
(MU-New Syllabus w.ef academic year 22-23)
[abrech.neoPublcaions.A SACHIN SHAH VentureWW
se are 6 uplesclasifed as AUTHORIZE, 2 tops cased sy
ified us CALL POLICE, EQUEST
(REQUEST ID) = 2/10
P{CALL POLICE) = 1/10
rots sca with gion nab bow
Tate [va oy
a mer] om | one | Meas | ETS
° ote a a
vam [oe spe | a
[a n Se a
nf a gop
ioe 7 come emoalas 2 falar
fa n pee
a Po cman
Cel 3 $8 casera art
Pal a co [eeoeess 2 efecealey
Ta a Ta
Fic ogowe tape = ncome = Medium, Crei= Good
1a AUTHORIZE) =P dwine» Mel AUTHORIZ x rt = CoetAUTHORZD
TIAUTHORIZE)
mEQID) xPMREQID) = POncome = MediumiREQ ID) x P(Credi = GoodIREQID) x PREQIP)
1
=o
POREIECT) x POREJECT) = Ptlocome = Meum! REJECT) xP (Crit = GoodREJECT) x PRET
Pau
2
xox% =0
= oxox =0
incon = MetiniCALL POLICE xP (Cedi GooatcALL POLICE
P(CALLFOLCD
P(UCALL POLICE) x P(CALL POLICE) =
Se eae "+ Letus considera rule RI, -
Ri: TF age = youth AND student = 9%
‘buy_ computer = yes
© The IF part of
DH 3.5 RULE BASED CLASSIFICATION :
the rule is calle rue ase
jcoemrent o
The THEN part ofthe rl called ule come
symou wat wader yu 225) Tver aorcnumuns aac"
]
(u-New2, To forma rule amecedeat, each sping eiterion
is logically ANDed,
4, The leaf node Hols the class precion, forming
there consequent
DW 5.6 ACCURACY AND ERROR MEASURES
+ In data mining, clasification involves the problem of
predicting which category or class a new observation
tongs in.
+The derived mode clasife)s based onthe analysis
ofa seo traning data where each datas iven alas
abel
+The tained model (lassie) is then used to predict
the cas label or new, unseen data.
+ To understand classification mets, one ofthe most
‘imporant concep the confusion mix
+ The diferent clasier evaluation measures ae
iscused below
1. Confusion Matrix:
how well your classifier can recognize wples of
itera classes. It also calle as comings ney mats
ach row in a confusion matrix represents an actual
clas, wile each column represents predicted las
‘The 27 confusion matrix is denoted as
Prediced Cass
1 [2
«useful oo for analyzing
‘al Cass [1
ol
+ TRerrepeseat he vals which ae preted 0
ve te and ar actully te.
TW ttrepeent te values whic re preted 10
be fa and ar actly fale
+ repress the vals which ae edited ©
te oe, bat ar fle, Also elle Type I,
7.
(U-Newsyabus wa academic year 22°23)
%.
(castlenn)..Page no (9-24
TN Teeprsent the ales which re predicted
ae tre. Also called Type I err
tee postive recognition
ve tuples hat are coecty
pore
HN
Specifilty + Aso called the rue negative ae
proportion of negative ples that are comely
ei.
~
Speciticity = TTP
“Accurney : The acuracy of clasifier ona piven test
setis the percentage of test set tuples tat re comely
lasted by te classifi. I salto refered 10 as the
‘veal cognition ate of te
TN
Aecureey = PFIN TFPI EN
Precision: Its the mesure of exactness. determines
what percentage of tuples labelled as positive are
sully pov,
_.
Precision = TERE
Recall: His the measure of completeness, It
dceraines what percentage of postive ples re
abled as postive.
1
Real = FE
Score: It is the harmonic mean of precision and
recall I gives equal weight to precision and real. is
also called F-measue or Fy cor.
pp = 2=Presbonx Rea!
= “Precision + Recall,
F, Seore: Its the weighed measure of precision and
recall Ke assign B times as mach weight real 3s 10
precision. Commonly used Fp measures art F (which
‘weights real ice as mach as precision) and Fs
(ich weights precision twice as mach as real)
Re
Error Rate: Iti lt called msclasiiation rte of
clasifierand is simply (I~ Accuracy).
(rece bcs. SACHIN SHAH Vertue>|
Data Warehousing and Mining (MU _—(Cassicatin).Page ne
» EVALUATING THE ACCURACY OF A ‘The data is trained on the training set ang,
‘CLASSIFIER
Besides the evaluation measure discussed above, other
techniques to evaluate the accuracy of a classifier are
discussed below.
3.7.1 Holdout
In this method, the mostly large dataset
is randomly divided to three subsets
1. Training set is a subset of the dataset used to
build predictive models.
2. Validation set is a subset of the dataset used to
assess the performance of model built in the
training phase. It provides a test platform for fine
tuning model's parameters and selecting the best-
performing model. Not all modeling algorithms
need a validation set.
3, Test set or unseen examples is a subset of the
dataset to assess the likely future performance of a
model. If a model fit to the training set much
beter than it fits the test set, overfiting is
probably the cause.
‘Typically, two-thirds of the data are allocated to the
training set and the remaining one-third is allocated to
the test set.
Avaliable data
Training
New available data
Training Validation Testing
Walidation | (testing
| holdout | holdout
sample) | sample)
wcisFig, 3.7.1 : Holdout
‘%. 3.7.2 Random Subsampling
It is a variation of the holdout method, The holdout
‘method is repeated k times.
It involves randomly splitting the data into a training
and a test set.
(MU-New Syllabus w.ef academic year 22-23)
square eror (MSE) is obaned fom he pr,
the test set my
‘This method is not recommended becatse
would depend on the split. So a new split can My
anew MSE and then you don’t know which oy *
_
‘The overall accuracy is calculated by «yin
average ofthe acuracies obtained from eich a,
1
k
rE
Total number of examples
me
Experiment 1 ray
ee
come T 1111
(vc@Fig. 3.7.2: Random Subsampling
3.7.3 Cross Validation
When only a limited amount of data is availabe,»
achieve an unbiased estimate of the model perfome
we use k-fold cross-validation
Ink-fold cross-validation,
into k subsets of equal size.
‘We build models k times, each time leaving out ott
the subsets from training and use it as the test set
Ifk equals the sample size, this is called "leave-on-
out”
we divide the da
3.7.4 Bootstrapping
Bootstrapping is a technique used to make estimates
from data by taking an average of the estimates fo
smaller data samples.
‘The bootstrap method involves iteratively resampl
dataset with replacement.
Instead of only estimating our statistic once 09 &
complete data, we can do it many times of # ©
sampling (with replacement) of the original sampl
Repeating this re-sampling multiple times allows 8°
obtain a vector of estimates.
We can then compute variance, expected i
empirical distribution, and other relevant statist
these estimates. 7
ig
ls
tit
LB rete Publicains_A SACHIN SHAH