Interna Nguyen Quoc Khanh – ITITIU18186
Introduction to Data Mining
Lab 3 – Simple Classifiers
3.1. Simplicity first!
In the third class, we are going to learn how to examine some data mining algorithms on datasets using
Weka. (See the lecture of class 3 by Ian H. Witten, [1] 1)
In this section, we learn how OneR (one attribute does all the work) works. Open weather.nominal.arff,
run OneR, look at the classifier model, how is it?
1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/
1
Interna Nguyen Quoc Khanh – ITITIU18186
l
- Remarks: When we run OneR, it will identify class based on outlook’s value for 4 attributes. By that
way, OneR produce 10/14 instances which are correctly guessed. However, because cross-validation
generates different outcomes, so we just have 6 instances which are correctly classified that bring only
42.86% rate of success.
Use OneR to build decision tree for some datasets. Compared with ZeroR, how does OneR perform?
Dataset OneR - accuracy ZeroR - accuracy
weather.nominal.arff 42.86 64.29
supermarket.arff 67.21 63.71
glass.arff 57.94 35.51
vote.arff 95.63 61.38
iris.arff 92.00 33.33
diabetes.arff 71.48 65.10
labor.arff 71.93 64.91
soybean.arff 39.97 13.47
breast-cancer.arff 65.73 70.28
credit-g.arff 66.10 70.00
3.2. Overfitting
What is “overfitting”? - overfitting occurs when a statistical model describes random error or noise
instead of the underlying relationship, b/c of complex model, noise/error in the data, or unsuitable
applied criterion, poor prediction. To avoid this, use cross-validation, or pruning... [ref:
http://en.wikipedia.org/wiki/Overfitting]
Follow the instructions in [1], run OneR on the weather.numeric and diabetes dataset…
Write down the results in the following table: (cross-validation used)
Dataset OneR ZeroR
weather.numeric Classifier model: Classifier model:
Accuracy: 64.29
Accuracy: 42.96
weather.numeric Classifier model: Classifier model:
w/o outlook att.
Accuracy: 64.29
2
Interna Nguyen Quoc Khanh – ITITIU18186
l
Accuracy: 50.00
diabetes Classifier model: Classifier model:
Accuracy: 65.10
Accuracy: 71.48
Diabetes w/ Classifier model: Classifier model: None
minBucketSize 1
Accuracy: None
Accuracy: 57.16
MinBucketSize? - 1
Remark? -
3
Interna Nguyen Quoc Khanh – ITITIU18186
l
3.3. Using probabilities
Lecture of Naïve Bayes: [1]
All attributes contribute equally and independently no identical attributes
Follow the instructions in [1] to exame NaiveBayes on weather.nominal
Classifier model Performance
3.4. Decision Trees
Lecture of decision trees: [1]
How to calculate entropy and information gain?
Entropy measures the impurity of a collection.
c
Entropy ( S ) =−∑ pi log 2 pi
i=1
4
Interna Nguyen Quoc Khanh – ITITIU18186
l
Information Gain measures the Expected Reduction in Entropy.
Info. Gain = (Entropy of distribution before the split) – (Entropy of distribution after the split)
|S v|
Gain ( S , A ) ≡ Entropy ( S )− ∑ |S|
Entropy(S v )
v ∈Values ( A )
Values(A) is the set of all possible values for attribute A and Sv is the subset of S for which attribute A has
value.
Build a decision tree for the weather data step by step:
Compute Entropy and Info. Gain Selected attribute
Temparature has 3 distinct values: Temperature
Cool
4 records, 3 are yes:
-(3434 +1414 ) = 0.81
Rainy
4 records, 2 are yes:
-(2424 +2424 ) = 1.0
Sunny
6 records, 4 are yes:
-(4646 +2626 ) = 0.92
Expected new entropy:
414x0.81 + 414x1.0 + 614x0.92 = 0.91
Information Gain:
0.94 – 0.91 = 0.03
Windy
Windy has 2 distinct values:
True
8 records, 6 are yes:
-6868 +2828 =0.81
False
5 records, 3 are yes:
-(3535 +2525 ) = 0.97
Expected new entropy:
814x0.81 + 614x0.97 = 0.87
Information Gain:
0.94 – 0.87 = 0.03
Humidity has 2 distinct values: Humidity
Normal
7 records, 6 are yes:
-(6767 +1717 ) = 0.59
High
5
Interna Nguyen Quoc Khanh – ITITIU18186
l
7 records, 2 are yes:
-(2727 +5757 ) = 0.86
Expected new entropy:
714x0.81 + 714x0.86 = 0.72
Information Gain:
0.94 – 0.72 = 0.22
Final decision tree
Use Weka to examine J48 on the weather data.
3.5. Pruning decision trees
Follow the lecture of pruning decision tree in [1] …
Why pruning? - Prevent overfitting to noise in the data.
In Weka, look at the J48 leaner. What are parameters: minNunObj, confidenceFactor?
- minNunObj: the minimum number of instances per leaf.
- confidenceFactor: the confidence factor usef for prunning (smaller values incur more prunning).
Follow the instructions in [1] to run J48 on the two dataset, then fill in the following table:
Dataset J48 (default, pruned) J48 (unpruned)
diabetes.arff 73.82% 72.66%
breast‐cancer.arff 75.52% 69.58%
3.6. Nearest neighbor
Follow the lecture in [1]
“Instance‐based” learning = “nearest‐neighbor” learning
What is k‐nearest‐neighbors (K-NN)? – which is majority class among neighbors.
6
Interna Nguyen Quoc Khanh – ITITIU18186
l
Follow the instructions in [1] to run lazy>IBk on the glass dataset with k = 1, 5, 20, and then fill its
accuracy in the following table:
Dataset IBk, k =1 IBk, k =5 IBk, k =20
Glass 70.567 67.757 65.4206