Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
100 views7 pages

Lab3 NguyenQuocKhanh ITITIU18186

This document discusses simple classifiers for data mining, including OneR, decision trees, Naive Bayes, and k-nearest neighbors (k-NN). It explains how each algorithm works at a high level and provides accuracy results when applying the algorithms to various datasets in Weka. It also discusses overfitting and the use of pruning to prevent overfitting in decision trees.

Uploaded by

Kay Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views7 pages

Lab3 NguyenQuocKhanh ITITIU18186

This document discusses simple classifiers for data mining, including OneR, decision trees, Naive Bayes, and k-nearest neighbors (k-NN). It explains how each algorithm works at a high level and provides accuracy results when applying the algorithms to various datasets in Weka. It also discusses overfitting and the use of pruning to prevent overfitting in decision trees.

Uploaded by

Kay Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Interna Nguyen Quoc Khanh – ITITIU18186

Introduction to Data Mining


Lab 3 – Simple Classifiers

3.1. Simplicity first!


In the third class, we are going to learn how to examine some data mining algorithms on datasets using
Weka. (See the lecture of class 3 by Ian H. Witten, [1] 1)

In this section, we learn how OneR (one attribute does all the work) works. Open weather.nominal.arff,
run OneR, look at the classifier model, how is it?

1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/

1
Interna Nguyen Quoc Khanh – ITITIU18186
l

- Remarks: When we run OneR, it will identify class based on outlook’s value for 4 attributes. By that
way, OneR produce 10/14 instances which are correctly guessed. However, because cross-validation
generates different outcomes, so we just have 6 instances which are correctly classified that bring only
42.86% rate of success.

Use OneR to build decision tree for some datasets. Compared with ZeroR, how does OneR perform?

Dataset OneR - accuracy ZeroR - accuracy


weather.nominal.arff 42.86 64.29
supermarket.arff 67.21 63.71
glass.arff 57.94 35.51
vote.arff 95.63 61.38
iris.arff 92.00 33.33
diabetes.arff 71.48 65.10
labor.arff 71.93 64.91
soybean.arff 39.97 13.47
breast-cancer.arff 65.73 70.28
credit-g.arff 66.10 70.00

3.2. Overfitting
What is “overfitting”? - overfitting occurs when a statistical model describes random error or noise
instead of the underlying relationship, b/c of complex model, noise/error in the data, or unsuitable
applied criterion,  poor prediction. To avoid this, use cross-validation, or pruning... [ref:
http://en.wikipedia.org/wiki/Overfitting]

Follow the instructions in [1], run OneR on the weather.numeric and diabetes dataset…

Write down the results in the following table: (cross-validation used)

Dataset OneR ZeroR


weather.numeric Classifier model: Classifier model:

Accuracy: 64.29
Accuracy: 42.96
weather.numeric Classifier model: Classifier model:
w/o outlook att.

Accuracy: 64.29

2
Interna Nguyen Quoc Khanh – ITITIU18186
l

Accuracy: 50.00
diabetes Classifier model: Classifier model:

Accuracy: 65.10

Accuracy: 71.48
Diabetes w/ Classifier model: Classifier model: None
minBucketSize 1
Accuracy: None

Accuracy: 57.16

MinBucketSize? - 1

Remark? -

3
Interna Nguyen Quoc Khanh – ITITIU18186
l

3.3. Using probabilities


Lecture of Naïve Bayes: [1]

 All attributes contribute equally and independently  no identical attributes

Follow the instructions in [1] to exame NaiveBayes on weather.nominal

Classifier model Performance

3.4. Decision Trees


Lecture of decision trees: [1]

How to calculate entropy and information gain?

Entropy measures the impurity of a collection.


c
Entropy ( S ) =−∑ pi log 2 pi
i=1

4
Interna Nguyen Quoc Khanh – ITITIU18186
l

Information Gain measures the Expected Reduction in Entropy.

Info. Gain = (Entropy of distribution before the split) – (Entropy of distribution after the split)

|S v|
Gain ( S , A ) ≡ Entropy ( S )− ∑ |S|
Entropy(S v )
v ∈Values ( A )

Values(A) is the set of all possible values for attribute A and Sv is the subset of S for which attribute A has
value.

Build a decision tree for the weather data step by step:

Compute Entropy and Info. Gain Selected attribute


Temparature has 3 distinct values: Temperature
 Cool
4 records, 3 are yes:
-(3434 +1414 ) = 0.81
 Rainy
4 records, 2 are yes:
-(2424 +2424 ) = 1.0
 Sunny
6 records, 4 are yes:
-(4646 +2626 ) = 0.92

Expected new entropy:


414x0.81 + 414x1.0 + 614x0.92 = 0.91
Information Gain:
0.94 – 0.91 = 0.03
Windy 
Windy has 2 distinct values:
 True
8 records, 6 are yes:

-6868 +2828 =0.81

 False
5 records, 3 are yes:
-(3535 +2525 ) = 0.97
Expected new entropy:
814x0.81 + 614x0.97 = 0.87
Information Gain:
0.94 – 0.87 = 0.03
Humidity has 2 distinct values: Humidity
 Normal
7 records, 6 are yes:
-(6767 +1717 ) = 0.59
 High

5
Interna Nguyen Quoc Khanh – ITITIU18186
l

7 records, 2 are yes:


-(2727 +5757 ) = 0.86
Expected new entropy:
714x0.81 + 714x0.86 = 0.72
Information Gain:
0.94 – 0.72 = 0.22
Final decision tree

Use Weka to examine J48 on the weather data.

3.5. Pruning decision trees


Follow the lecture of pruning decision tree in [1] …

Why pruning? - Prevent overfitting to noise in the data.

In Weka, look at the J48 leaner. What are parameters: minNunObj, confidenceFactor?

- minNunObj: the minimum number of instances per leaf.


- confidenceFactor: the confidence factor usef for prunning (smaller values incur more prunning).

Follow the instructions in [1] to run J48 on the two dataset, then fill in the following table:

Dataset J48 (default, pruned) J48 (unpruned)


diabetes.arff 73.82% 72.66%
breast‐cancer.arff 75.52% 69.58%

3.6. Nearest neighbor


Follow the lecture in [1]

“Instance‐based” learning = “nearest‐neighbor” learning

What is k‐nearest‐neighbors (K-NN)? – which is majority class among neighbors.

6
Interna Nguyen Quoc Khanh – ITITIU18186
l

Follow the instructions in [1] to run lazy>IBk on the glass dataset with k = 1, 5, 20, and then fill its
accuracy in the following table:

Dataset IBk, k =1 IBk, k =5 IBk, k =20

Glass 70.567 67.757 65.4206

You might also like