0% found this document useful (0 votes)

54 views5 pages

Data Mining - Lab 2

Lab 2 focuses on classification in data mining, where students will implement various algorithms including Decision Tree, Random Forest, Naive Bayes, SVM, and Neural Networks using the Adult Census Income dataset. Students will evaluate model performance using metrics such as confusion matrix, accuracy, precision, recall, and ROC curve. The assignment requires a detailed report in Jupyter Notebook format, including self-evaluation and analysis of results.

Uploaded by

Yến Quyên

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views5 pages

Data Mining - Lab 2

Uploaded by

Yến Quyên

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Lab 2.

Classification Data Mining CSC14004

Lab 2

Classification
In data mining, classification is an important task which used to categorize data into classes.
The objective is to assign a label to each input sample based on its attributes (also its features).
This lab focuses on implementing widely used classification algorithms:

1. Decision Tree

2. Ensemble Model: Random Forest

3. Naive Bayes

4. Support Vector Machine

5. Neural Network: Multilayer Perceptron

Students will explore these algorithms with a real world dataset and compare the results using
various evaluation metrics:

1. Confusion Matrix

2. Accuracy

3. Precision

4. Recall (Sensitivity or True Positive Rate)

5. Specificity (True Negative Rate)

6. F1 Score

7. ROC Curve and AUC Score

By completing this lab, students will:

1. Familiar with libraries that support classification tasks for implementing above algorithms.

2. Experiment with hyper-parameter tuning to optimize model performance.

3. Visualize and analyze the performance of the models using evaluation metrics.

4. Gain insights into the strengths and weaknesses of different classification algorithms.

University of Science Faculty of Information Technology Page 1

Lab 2. Classification Data Mining CSC14004

1 Dataset
The Adult Census Income dataset, also known as the Census Income dataset or the Adult dataset,
is sourced from the U.S. Census Bureau Database and is commonly used in classification tasks.
The goal of this dataset is to predict if an individual’s annual income exceeds $50,000 based on
demographic characteristics.
The dataset contains 48,842 records and 14 attributes (features) as follows:

Attribute Description
age The age of the individual.
workclass The type of employment (e.g., Private, Government, Self-employed, etc.).
fnlwgt The final weight of the survey.
education The level of education.
education-num The number of years of education, in integer form.
marital-status Marital status (single, married, divorced, etc.).
occupation Occupation (Managerial, Clerical, Service, etc.).
relationship Relationship to the household (spouse, child, other relative, etc.).
race Race.
sex Gender.
capital-gain Income from capital (not from wages).
capital-loss Loss from capital (from investments).
hours-per-week The number of hours worked per week.
native-country Country of origin.

The dataset also contains a target variable income with two values, including <=50K and >50K,
representing the individual’s income.
To get the dataset, please visit: Adult - UCI Machine Learning.

You probably must be familiar with this dataset through the Project 1 - Data Preprocessing.
You can reuse the results of that assignment to continue developing classification algorithms, or
preprocess them yourself again to suit the problem.
Students need to split the dataset into two subsets: a training set (70%) and a test set (30%)
using the train_test_split function from the scikit-learn library. To ensure reproducibility,
the random_state parameter should be set. You need to shuffle the data before splitting and split
it in a stratified fashion. Other parameters (if any) should be left at their default settings.

University of Science Faculty of Information Technology Page 2

Lab 2. Classification Data Mining CSC14004

2 Requirements
2.1 Classification algorithms implementation
Students will implement the following classification algorithms. They may use any Python library
(e.g., scikit-learn, tensorflow, keras).

1. Decision Tree

• Use a decision tree classifier with the parameter tuning for max depth and criterion
(e.g., gini, entropy).

2. Ensemble Model: Random Forest

• Build an ensemble of decision trees.

• Experiment with n estimators, max features, and bootstrap.

3. Naive Bayes

• Implement Gaussian Naive Bayes (or any other variants like Bernoulli or Multinomial
if applicable to the dataset).
• Experiment with var_smoothing for Gaussian Naive Bayes or alpha for Bernoulli or
Multinomial Naive Bayes.

4. Support Vector Machine (SVM)

• Train an SVM model with different kernels (linear, poly, rbf or sigmoid).
• Experiment parameter tuning with C or gamma.

5. Neural Network: Multilayer Perceptron (MLP)

• Build a feedforward neural network.

• Experiment with hidden layer sizes, activation functions, and learning rates.

To perform hyperparameter optimization for each algorithm to identify the best parameters:

• You can research for the best parameters online, manual tuning, or for more professional,
use grid search or random search with cross-validation.

• Document the parameters tested and the best combination found.

University of Science Faculty of Information Technology Page 3

Lab 2. Classification Data Mining CSC14004

2.2 Evaluation
Evaluate each model using the following metrics, categorized by type:

• Performance Overview:

– Confusion Matrix: Visualize using a heatmap for a detailed summary of predictions.

– Accuracy: Overall percentage of correct predictions.

• Metrics for Positive and Negative Classes:

– Precision: Proportion of correctly predicted positives out of all predicted positives.

– Recall (Sensitivity or True Positive Rate): Proportion of actual positives correctly
identified.
– Specificity (True Negative Rate): Proportion of actual negatives that correctly
identified.
– F1 Score: Harmonic mean of Precision and Recall, balancing their trade-off.

• Model Discrimination Ability:

– ROC Curve and AUC Score: Plot ROC curves for all models on the same graph
and compute the Area Under the Curve (AUC) to compare performance.

Use these metrics to analyze the trade-offs between different algorithms.

2.3 Comparison and Analysis

After implementing and evaluating all classification algorithms, analyze their performance across
all metrics and provide detailed insights:

• Compare on evaluation metrics, such as accuracy, etc., to identify the best-performing model.
Use the confusion matrix to analyze misclassifications and discuss potential behind reason
(e.g., class imbalance).

• Compare runtime and scalability, noting trade-offs between speed and prediction quality.

• Recommend the best model(s) based on evaluation metrics, computational cost, and dataset
characteristics.

University of Science Faculty of Information Technology Page 4

Lab 2. Classification Data Mining CSC14004

3 Report (Jupyter Notebook)

The source code, result will be reported in a Jupyter Notebook with the following requirements:

• Student information (Student ID, full name, etc.).

• Self-evaluation of the assignment requirements.

• Detailed explanation of each step. Illustrative images, diagrams and equations are required.

• Each processing step must be fully commented, and results should be printed for observation.

• The report needs to be well-formatted.

• Before submitting, re-run the notebook (Kernel → Restart & Run All).

• References (if any).

4 Assessment
No. Details Score
1 Classification algorithms implementation. 50%
2 Evaluation. 25%
3 Comparison and Analysis. 25%

5 Notices
Please pay attention to the following notices:

• This is an INDIVIDUAL assignment.

• Duration: about 2 weeks.

• Any plagiarism, any tricks, or any lie will have a 0 point for the course grade.

The end.

University of Science Faculty of Information Technology Page 5

Machine Learning File
No ratings yet
Machine Learning File
28 pages
ML SP24 Mid Term Exam - Solution
No ratings yet
ML SP24 Mid Term Exam - Solution
8 pages
AIMLB PGP 2025 Session 8
No ratings yet
AIMLB PGP 2025 Session 8
52 pages
Mlfile
No ratings yet
Mlfile
32 pages
AI Course Experiments Certificate
No ratings yet
AI Course Experiments Certificate
69 pages
Diagnostic Table For Yanmar 4TNV98 ZNMS Tier 3 Engine
100% (1)
Diagnostic Table For Yanmar 4TNV98 ZNMS Tier 3 Engine
3 pages
Classification Notes
No ratings yet
Classification Notes
14 pages
Final 1
No ratings yet
Final 1
6 pages
Live Classroom 2
No ratings yet
Live Classroom 2
40 pages
Ikigai The Japanese Secret To PDF
0% (1)
Ikigai The Japanese Secret To PDF
1 page
Churn Prediction with ML Techniques
No ratings yet
Churn Prediction with ML Techniques
77 pages
Paper On Machine Learning For Kaggle
No ratings yet
Paper On Machine Learning For Kaggle
40 pages
What Does This File Say - What Should I Do - I Have
No ratings yet
What Does This File Say - What Should I Do - I Have
14 pages
Module 3
No ratings yet
Module 3
132 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
Data Mining - Lab 1
No ratings yet
Data Mining - Lab 1
4 pages
Group B: Machine Learning
No ratings yet
Group B: Machine Learning
25 pages
Tushar ML
No ratings yet
Tushar ML
52 pages
ME P4252-II Semester - MACHINE LEARNING
100% (1)
ME P4252-II Semester - MACHINE LEARNING
48 pages
Introduction To Data Mining & Classification
No ratings yet
Introduction To Data Mining & Classification
58 pages
Data Mining & Machine Learning Courseoutline
No ratings yet
Data Mining & Machine Learning Courseoutline
7 pages
ML Mid Sem Sep2023 Paper
No ratings yet
ML Mid Sem Sep2023 Paper
3 pages
HFS File Sharing Guide
No ratings yet
HFS File Sharing Guide
53 pages
Credit Card Approval Prediction Report-Final
No ratings yet
Credit Card Approval Prediction Report-Final
27 pages
CS 2 3 4 Aml
No ratings yet
CS 2 3 4 Aml
70 pages
L05 - Advance Analytical Theory and Methods - Classification
No ratings yet
L05 - Advance Analytical Theory and Methods - Classification
34 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
08 - Classification - Decision Trees
No ratings yet
08 - Classification - Decision Trees
116 pages
Exploring, Transforming, and Summarizing Input Datasets For Building Classification Models
No ratings yet
Exploring, Transforming, and Summarizing Input Datasets For Building Classification Models
21 pages
ML Lab Syllabus
No ratings yet
ML Lab Syllabus
2 pages
Week 5
No ratings yet
Week 5
72 pages
Problem 1: Cse352 AI Homework 3 Solutions
No ratings yet
Problem 1: Cse352 AI Homework 3 Solutions
31 pages
Classification
No ratings yet
Classification
36 pages
Ba 5211 - Data Analysis and Business Modeling
No ratings yet
Ba 5211 - Data Analysis and Business Modeling
88 pages
DC Charging TCP/IP (Optional) Micro Usb (Optional) USB Link: Realtime T502
No ratings yet
DC Charging TCP/IP (Optional) Micro Usb (Optional) USB Link: Realtime T502
1 page
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
26 pages
Process Gas Compressors
100% (1)
Process Gas Compressors
24 pages
Perform Prediction Using Regression Algorithm: Ex No: 1 Date
No ratings yet
Perform Prediction Using Regression Algorithm: Ex No: 1 Date
13 pages
23ca22p1 - Machine Learning Lab
No ratings yet
23ca22p1 - Machine Learning Lab
2 pages
Machine Learning (Se204A) Lab Manual
No ratings yet
Machine Learning (Se204A) Lab Manual
27 pages
MILIT PPT Modifies
No ratings yet
MILIT PPT Modifies
43 pages
ML Lab Manual
No ratings yet
ML Lab Manual
14 pages
CE802 Pilot
No ratings yet
CE802 Pilot
2 pages
Prac 5
No ratings yet
Prac 5
4 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
US Census Income Prediction Model
No ratings yet
US Census Income Prediction Model
18 pages
DM LabManual Teena
No ratings yet
DM LabManual Teena
6 pages
Classification Ppts 2021
No ratings yet
Classification Ppts 2021
80 pages
INSY446 - 4 - Classification Part 1
No ratings yet
INSY446 - 4 - Classification Part 1
26 pages
CS305 Exercise 5: Task 1: Comparing Machine Learning Algorithms
No ratings yet
CS305 Exercise 5: Task 1: Comparing Machine Learning Algorithms
7 pages
MLT 1 - 7 Kanish
No ratings yet
MLT 1 - 7 Kanish
24 pages
User's Guide For The AT&T Global Network Client For Linux: System Requirements and Installation
No ratings yet
User's Guide For The AT&T Global Network Client For Linux: System Requirements and Installation
2 pages
CE802 Report
No ratings yet
CE802 Report
7 pages
Classification
No ratings yet
Classification
33 pages
MSC Academic Internship Config Manual IDS Improvement Using MIGBM Feature Selection
No ratings yet
MSC Academic Internship Config Manual IDS Improvement Using MIGBM Feature Selection
19 pages
Assignment
No ratings yet
Assignment
5 pages
333 High Frequency GRE Words With Meanings
No ratings yet
333 High Frequency GRE Words With Meanings
7 pages
Important Questions
No ratings yet
Important Questions
4 pages
About Classificatio1
No ratings yet
About Classificatio1
5 pages
Optimized Classification On Forest Covertype: COMP5318 M L D M A 2 R
No ratings yet
Optimized Classification On Forest Covertype: COMP5318 M L D M A 2 R
16 pages
A Spatial Data Structure For Fast Poisson-Disk Sample Generation
No ratings yet
A Spatial Data Structure For Fast Poisson-Disk Sample Generation
6 pages
Machine Learning Part: Domain Overview
No ratings yet
Machine Learning Part: Domain Overview
20 pages
Data Science and Data Analytics Lab CS695A: Sayan Maity Cse 3B Roll-05 12017009001193
No ratings yet
Data Science and Data Analytics Lab CS695A: Sayan Maity Cse 3B Roll-05 12017009001193
30 pages
Paint Color Codes Guide
No ratings yet
Paint Color Codes Guide
10 pages
1 清实录10 高宗纯皇帝实录卷六○至卷一五七
No ratings yet
1 清实录10 高宗纯皇帝实录卷六○至卷一五七
600 pages
Polymorphism Assignment
No ratings yet
Polymorphism Assignment
5 pages
Labppaper
No ratings yet
Labppaper
3 pages
Resume Limpia Banerjee
No ratings yet
Resume Limpia Banerjee
3 pages
DM Lab Assignment 2
No ratings yet
DM Lab Assignment 2
2 pages
MOSFET Basics for Engineering Students
No ratings yet
MOSFET Basics for Engineering Students
46 pages
Intel® Core™2 Duo Processor E7500
No ratings yet
Intel® Core™2 Duo Processor E7500
4 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
Weak-Measurement Elements of Reality: Lev Vaidman
No ratings yet
Weak-Measurement Elements of Reality: Lev Vaidman
11 pages
Example Network Diagram: Msa Bts1 Bsc1 Msc/Vlr1 Air Interface/Lapdm Abis Interface/Lapd A Interface Map - E Interface
No ratings yet
Example Network Diagram: Msa Bts1 Bsc1 Msc/Vlr1 Air Interface/Lapdm Abis Interface/Lapd A Interface Map - E Interface
40 pages
Harnessing The Reasoning Economy A Survey of Efficient Reasoning For Large Language Models
No ratings yet
Harnessing The Reasoning Economy A Survey of Efficient Reasoning For Large Language Models
24 pages
Operating Systems Course Guide
No ratings yet
Operating Systems Course Guide
2 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Pertemuan 3. Business Motivations and Drivers For Big Data Adoption
No ratings yet
Pertemuan 3. Business Motivations and Drivers For Big Data Adoption
16 pages
Sat - 100.Pdf - Prediction of Cyber Attacks Using Data Science Technique
No ratings yet
Sat - 100.Pdf - Prediction of Cyber Attacks Using Data Science Technique
11 pages
Chapter Three: Key System Applications For The Digital Age
No ratings yet
Chapter Three: Key System Applications For The Digital Age
37 pages
Caliptra Security Insights
No ratings yet
Caliptra Security Insights
71 pages
PGDCA Project: Time Table System
No ratings yet
PGDCA Project: Time Table System
4 pages
Bank Account Transactions June-July 2024
No ratings yet
Bank Account Transactions June-July 2024
18 pages
Sinhgad Institute of Management, Pune-41: Assignment No.4
No ratings yet
Sinhgad Institute of Management, Pune-41: Assignment No.4
2 pages
Bluetooth Communication Using A Touchscreen Interface With The Raspberry Pi
No ratings yet
Bluetooth Communication Using A Touchscreen Interface With The Raspberry Pi
4 pages
Coast Guard Exam Admit Card
No ratings yet
Coast Guard Exam Admit Card
7 pages
NetWorker 19.1 Installation Guide PDF
No ratings yet
NetWorker 19.1 Installation Guide PDF
196 pages
Lecture 2
No ratings yet
Lecture 2
47 pages

Data Mining - Lab 2

Uploaded by

Data Mining - Lab 2

Uploaded by

Lab 2.

Classification Data Mining CSC14004

2. Ensemble Model: Random Forest

4. Support Vector Machine

5. Neural Network: Multilayer Perceptron

4. Recall (Sensitivity or True Positive Rate)

5. Specificity (True Negative Rate)

7. ROC Curve and AUC Score

By completing this lab, students will:

2. Experiment with hyper-parameter tuning to optimize model performance.

University of Science Faculty of Information Technology Page 1

University of Science Faculty of Information Technology Page 2

2. Ensemble Model: Random Forest

• Build an ensemble of decision trees.

4. Support Vector Machine (SVM)

5. Neural Network: Multilayer Perceptron (MLP)

• Build a feedforward neural network.

• Document the parameters tested and the best combination found.

University of Science Faculty of Information Technology Page 3

– Confusion Matrix: Visualize using a heatmap for a detailed summary of predictions.

• Metrics for Positive and Negative Classes:

– Precision: Proportion of correctly predicted positives out of all predicted positives.

• Model Discrimination Ability:

Use these metrics to analyze the trade-offs between different algorithms.

2.3 Comparison and Analysis

University of Science Faculty of Information Technology Page 4

3 Report (Jupyter Notebook)

• Student information (Student ID, full name, etc.).

• Self-evaluation of the assignment requirements.

• The report needs to be well-formatted.

• References (if any).

• This is an INDIVIDUAL assignment.

• Duration: about 2 weeks.

University of Science Faculty of Information Technology Page 5

You might also like