0% found this document useful (0 votes)

100 views7 pages

Lab3 NguyenQuocKhanh ITITIU18186

This document discusses simple classifiers for data mining, including OneR, decision trees, Naive Bayes, and k-nearest neighbors (k-NN). It explains how each algorithm works at a high level and provides accuracy results when applying the algorithms to various datasets in Weka. It also discusses overfitting and the use of pruning to prevent overfitting in decision trees.

Uploaded by

Kay Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

100 views7 pages

Lab3 NguyenQuocKhanh ITITIU18186

Uploaded by

Kay Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Interna Nguyen Quoc Khanh – ITITIU18186

Introduction to Data Mining

Lab 3 – Simple Classifiers

3.1. Simplicity first!

In the third class, we are going to learn how to examine some data mining algorithms on datasets using
Weka. (See the lecture of class 3 by Ian H. Witten, [1] 1)

In this section, we learn how OneR (one attribute does all the work) works. Open weather.nominal.arff,
run OneR, look at the classifier model, how is it?

1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/

1
Interna Nguyen Quoc Khanh – ITITIU18186
l

- Remarks: When we run OneR, it will identify class based on outlook’s value for 4 attributes. By that
way, OneR produce 10/14 instances which are correctly guessed. However, because cross-validation
generates different outcomes, so we just have 6 instances which are correctly classified that bring only
42.86% rate of success.

Use OneR to build decision tree for some datasets. Compared with ZeroR, how does OneR perform?

Dataset OneR - accuracy ZeroR - accuracy

weather.nominal.arff 42.86 64.29
supermarket.arff 67.21 63.71
glass.arff 57.94 35.51
vote.arff 95.63 61.38
iris.arff 92.00 33.33
diabetes.arff 71.48 65.10
labor.arff 71.93 64.91
soybean.arff 39.97 13.47
breast-cancer.arff 65.73 70.28
credit-g.arff 66.10 70.00

3.2. Overfitting
What is “overfitting”? - overfitting occurs when a statistical model describes random error or noise
instead of the underlying relationship, b/c of complex model, noise/error in the data, or unsuitable
applied criterion,  poor prediction. To avoid this, use cross-validation, or pruning... [ref:
http://en.wikipedia.org/wiki/Overfitting]

Follow the instructions in [1], run OneR on the weather.numeric and diabetes dataset…

Write down the results in the following table: (cross-validation used)

Dataset OneR ZeroR

weather.numeric Classifier model: Classifier model:

Accuracy: 64.29
Accuracy: 42.96
weather.numeric Classifier model: Classifier model:
w/o outlook att.

Accuracy: 64.29

2
Interna Nguyen Quoc Khanh – ITITIU18186
l

Accuracy: 50.00
diabetes Classifier model: Classifier model:

Accuracy: 65.10

Accuracy: 71.48
Diabetes w/ Classifier model: Classifier model: None
minBucketSize 1
Accuracy: None

Accuracy: 57.16

MinBucketSize? - 1

Remark? -

3
Interna Nguyen Quoc Khanh – ITITIU18186
l

3.3. Using probabilities

Lecture of Naïve Bayes: [1]

 All attributes contribute equally and independently  no identical attributes

Follow the instructions in [1] to exame NaiveBayes on weather.nominal

Classifier model Performance

3.4. Decision Trees

Lecture of decision trees: [1]

How to calculate entropy and information gain?

Entropy measures the impurity of a collection.

c
Entropy ( S ) =−∑ pi log 2 pi
i=1

4
Interna Nguyen Quoc Khanh – ITITIU18186
l

Information Gain measures the Expected Reduction in Entropy.

Info. Gain = (Entropy of distribution before the split) – (Entropy of distribution after the split)

|S v|
Gain ( S , A ) ≡ Entropy ( S )− ∑ |S|
Entropy(S v )
v ∈Values ( A )

Values(A) is the set of all possible values for attribute A and Sv is the subset of S for which attribute A has
value.

Build a decision tree for the weather data step by step:

Compute Entropy and Info. Gain Selected attribute

Temparature has 3 distinct values: Temperature
 Cool
4 records, 3 are yes:
-(3434 +1414 ) = 0.81
 Rainy
4 records, 2 are yes:
-(2424 +2424 ) = 1.0
 Sunny
6 records, 4 are yes:
-(4646 +2626 ) = 0.92

Expected new entropy:

414x0.81 + 414x1.0 + 614x0.92 = 0.91
Information Gain:
0.94 – 0.91 = 0.03
Windy
Windy has 2 distinct values:
 True
8 records, 6 are yes:

-6868 +2828 =0.81

 False
5 records, 3 are yes:
-(3535 +2525 ) = 0.97
Expected new entropy:
814x0.81 + 614x0.97 = 0.87
Information Gain:
0.94 – 0.87 = 0.03
Humidity has 2 distinct values: Humidity
 Normal
7 records, 6 are yes:
-(6767 +1717 ) = 0.59
 High

5
Interna Nguyen Quoc Khanh – ITITIU18186
l

7 records, 2 are yes:

-(2727 +5757 ) = 0.86
Expected new entropy:
714x0.81 + 714x0.86 = 0.72
Information Gain:
0.94 – 0.72 = 0.22
Final decision tree

Use Weka to examine J48 on the weather data.

3.5. Pruning decision trees

Follow the lecture of pruning decision tree in [1] …

Why pruning? - Prevent overfitting to noise in the data.

In Weka, look at the J48 leaner. What are parameters: minNunObj, confidenceFactor?

- minNunObj: the minimum number of instances per leaf.

- confidenceFactor: the confidence factor usef for prunning (smaller values incur more prunning).

Follow the instructions in [1] to run J48 on the two dataset, then fill in the following table:

Dataset J48 (default, pruned) J48 (unpruned)

diabetes.arff 73.82% 72.66%
breast‐cancer.arff 75.52% 69.58%

3.6. Nearest neighbor

Follow the lecture in [1]

“Instance‐based” learning = “nearest‐neighbor” learning

What is k‐nearest‐neighbors (K-NN)? – which is majority class among neighbors.

6
Interna Nguyen Quoc Khanh – ITITIU18186
l

Follow the instructions in [1] to run lazy>IBk on the glass dataset with k = 1, 5, 20, and then fill its
accuracy in the following table:

Dataset IBk, k =1 IBk, k =5 IBk, k =20

Glass 70.567 67.757 65.4206

Introduction To Data Mining
No ratings yet
Introduction To Data Mining
3 pages
kNN Classification Lab Guide
No ratings yet
kNN Classification Lab Guide
4 pages
Cluster Validation
No ratings yet
Cluster Validation
47 pages
Decision Trees
No ratings yet
Decision Trees
25 pages
Hw1 Theory Solution PuHK4fmHvB
No ratings yet
Hw1 Theory Solution PuHK4fmHvB
4 pages
Lecture Notes - Random Forests PDF
100% (1)
Lecture Notes - Random Forests PDF
4 pages
Intro to Machine Learning Basics
100% (1)
Intro to Machine Learning Basics
52 pages
2024 Decision Trees
No ratings yet
2024 Decision Trees
28 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Uoit Calculus 2 Midterm
No ratings yet
Uoit Calculus 2 Midterm
26 pages
Weka Tutorial
No ratings yet
Weka Tutorial
2 pages
NguyenCongSang ITITIU20292 Lab2
No ratings yet
NguyenCongSang ITITIU20292 Lab2
13 pages
Chapter
100% (1)
Chapter
101 pages
Unit 3
100% (1)
Unit 3
21 pages
Appendix Weka
No ratings yet
Appendix Weka
17 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Classification
100% (1)
Classification
37 pages
Decision Tree Classifier & ID3 Guide
No ratings yet
Decision Tree Classifier & ID3 Guide
34 pages
Machine Learning: Engr. Ejaz Ahmad
No ratings yet
Machine Learning: Engr. Ejaz Ahmad
54 pages
Ain Shams University Faculty of Engineering
No ratings yet
Ain Shams University Faculty of Engineering
2 pages
Introduction To Tree Methods
No ratings yet
Introduction To Tree Methods
15 pages
By Ghazwan Khalid Auda
100% (1)
By Ghazwan Khalid Auda
17 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
Outlines: Statements of Problems Objectives Bagging Random Forest Boosting Adaboost
100% (1)
Outlines: Statements of Problems Objectives Bagging Random Forest Boosting Adaboost
14 pages
Computer Architecture - Lab 6: The Runtime Stack
No ratings yet
Computer Architecture - Lab 6: The Runtime Stack
8 pages
Decision Trees for Data Scientists
No ratings yet
Decision Trees for Data Scientists
25 pages
SVM Example
No ratings yet
SVM Example
10 pages
NguyenCongSang ITITIU20292 Lab3
No ratings yet
NguyenCongSang ITITIU20292 Lab3
21 pages
DM 04 04 Rule-Based Classification
No ratings yet
DM 04 04 Rule-Based Classification
72 pages
Data Mining Tutorials
No ratings yet
Data Mining Tutorials
52 pages
Bayesforbeginners
No ratings yet
Bayesforbeginners
21 pages
Data Mining-Outlier Analysis
No ratings yet
Data Mining-Outlier Analysis
6 pages
k-Means Clustering Algorithm Guide
No ratings yet
k-Means Clustering Algorithm Guide
27 pages
Titanic Classification Disaster Kaggle
No ratings yet
Titanic Classification Disaster Kaggle
18 pages
Non Parametric Methods 8
No ratings yet
Non Parametric Methods 8
23 pages
CH 6
No ratings yet
CH 6
72 pages
Decision Trees
No ratings yet
Decision Trees
5 pages
Unit-5 Decision Trees and Ensemble Learning
100% (1)
Unit-5 Decision Trees and Ensemble Learning
162 pages
A Algorithm
No ratings yet
A Algorithm
3 pages
IIIT-B Postgrad Assessment Guide
No ratings yet
IIIT-B Postgrad Assessment Guide
13 pages
Unit 5 Visualization of Volumetric Data
No ratings yet
Unit 5 Visualization of Volumetric Data
5 pages
Thinkcspy 3
100% (1)
Thinkcspy 3
415 pages
Inferential Statistics
No ratings yet
Inferential Statistics
111 pages
Data Mining Lab 02 - Truong Quang Tuong - ITITIU20130
No ratings yet
Data Mining Lab 02 - Truong Quang Tuong - ITITIU20130
8 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
139 pages
Retail Data Insights & Strategies
No ratings yet
Retail Data Insights & Strategies
24 pages
01-Introduction Machine Learning
100% (1)
01-Introduction Machine Learning
48 pages
Ds and Algo
No ratings yet
Ds and Algo
2 pages
Lecture08 AI UMT Fall 2020 21 - V3
No ratings yet
Lecture08 AI UMT Fall 2020 21 - V3
31 pages
ANN Classification with Python & R
No ratings yet
ANN Classification with Python & R
9 pages
Adversarial Search in AI Games
No ratings yet
Adversarial Search in AI Games
30 pages
CS550 Regression Aug12
100% (1)
CS550 Regression Aug12
63 pages
Decision Trees
No ratings yet
Decision Trees
32 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
Linear Regression: Major: All Engineering Majors Authors: Autar Kaw, Luke Snyder
100% (1)
Linear Regression: Major: All Engineering Majors Authors: Autar Kaw, Luke Snyder
25 pages
Lab3 Form
No ratings yet
Lab3 Form
5 pages
L5 - Decision Tree - B
No ratings yet
L5 - Decision Tree - B
51 pages
Unit 6 Finalized
No ratings yet
Unit 6 Finalized
30 pages
Classification - Issues Regarding Classification and Prediction
No ratings yet
Classification - Issues Regarding Classification and Prediction
42 pages
Yapay Zeka Ve Makine Öğrenmesi 10
No ratings yet
Yapay Zeka Ve Makine Öğrenmesi 10
34 pages
Asu Major Map
No ratings yet
Asu Major Map
5 pages
UCSP Lesson 2 Aspects of Culture PDF
No ratings yet
UCSP Lesson 2 Aspects of Culture PDF
32 pages
Defining Social Studies
No ratings yet
Defining Social Studies
13 pages
Huemann Rethink Project Stakeholder Management
No ratings yet
Huemann Rethink Project Stakeholder Management
4 pages
Lesson Plan
No ratings yet
Lesson Plan
2 pages
Geography Fieldwork for Educators
No ratings yet
Geography Fieldwork for Educators
27 pages
People Analytics Executive Summary - tcm18 43748
No ratings yet
People Analytics Executive Summary - tcm18 43748
11 pages
Unit 3 Notes Curriculum Development
100% (1)
Unit 3 Notes Curriculum Development
97 pages
NDA 2022 Math Prep Quiz
No ratings yet
NDA 2022 Math Prep Quiz
136 pages
Cultural Assets in Digital Age
No ratings yet
Cultural Assets in Digital Age
6 pages
HANDOUTS in RPH
No ratings yet
HANDOUTS in RPH
7 pages
2021 Scope and Sequence Essential Skills Maths
No ratings yet
2021 Scope and Sequence Essential Skills Maths
3 pages
Concept, Basic Characteristics & Preparation of Master Plan
100% (1)
Concept, Basic Characteristics & Preparation of Master Plan
3 pages
List of Course Selection Advisors
No ratings yet
List of Course Selection Advisors
3 pages
Evolution of Communication in India
No ratings yet
Evolution of Communication in India
42 pages
Target Population Thesis
100% (2)
Target Population Thesis
6 pages
Skills & Qualities for Students
No ratings yet
Skills & Qualities for Students
6 pages
STTP OF CE On ETABS
No ratings yet
STTP OF CE On ETABS
3 pages
September 2021
No ratings yet
September 2021
1 page
Qualitative Research Practical Research
100% (1)
Qualitative Research Practical Research
339 pages
Indian NM Problem 1
No ratings yet
Indian NM Problem 1
76 pages
Marco Sgarbi - KantKongress
No ratings yet
Marco Sgarbi - KantKongress
12 pages
Recruiter's Perception of Artificial Intelligence (AI) - Based Tools
No ratings yet
Recruiter's Perception of Artificial Intelligence (AI) - Based Tools
10 pages
Hyperspectral Image Classification
No ratings yet
Hyperspectral Image Classification
7 pages
Holistic Health Concepts for BSN Students
No ratings yet
Holistic Health Concepts for BSN Students
13 pages
Fictional Truths: Sherlock Holmes
No ratings yet
Fictional Truths: Sherlock Holmes
10 pages
Child and Adolescent Learner: Module 4
No ratings yet
Child and Adolescent Learner: Module 4
20 pages
School Culture
No ratings yet
School Culture
8 pages
Chapter 1 - Introduction To History
No ratings yet
Chapter 1 - Introduction To History
5 pages
GenAI at Coursera
No ratings yet
GenAI at Coursera
2 pages

Lab3 NguyenQuocKhanh ITITIU18186

Uploaded by

Lab3 NguyenQuocKhanh ITITIU18186

Uploaded by

Interna Nguyen Quoc Khanh – ITITIU18186

Introduction to Data Mining

3.1. Simplicity first!

Dataset OneR - accuracy ZeroR - accuracy

Write down the results in the following table: (cross-validation used)

Dataset OneR ZeroR

3.3. Using probabilities

 All attributes contribute equally and independently  no identical attributes

Follow the instructions in [1] to exame NaiveBayes on weather.nominal

Classifier model Performance

3.4. Decision Trees

How to calculate entropy and information gain?

Entropy measures the impurity of a collection.

Information Gain measures the Expected Reduction in Entropy.

Build a decision tree for the weather data step by step:

Compute Entropy and Info. Gain Selected attribute

Expected new entropy:

-6868 +2828 =0.81

7 records, 2 are yes:

Use Weka to examine J48 on the weather data.

3.5. Pruning decision trees

Why pruning? - Prevent overfitting to noise in the data.

- minNunObj: the minimum number of instances per leaf.

Dataset J48 (default, pruned) J48 (unpruned)

3.6. Nearest neighbor

“Instance‐based” learning = “nearest‐neighbor” learning

What is k‐nearest‐neighbors (K-NN)? – which is majority class among neighbors.

Dataset IBk, k =1 IBk, k =5 IBk, k =20

Glass 70.567 67.757 65.4206

You might also like