0% found this document useful (0 votes)

5 views8 pages

Important Notes

The document discusses data mining issues using the Titanic Survival Dataset, highlighting challenges such as data quality, noisy data, irrelevant features, class imbalance, data integration, privacy concerns, and scalability. It also outlines data preprocessing steps for a House Price Prediction dataset, including data collection, cleaning, transformation, feature engineering, scaling, and splitting. Finally, it explains how to calculate Information Gain using a Decision Tree Classifier with a Weather dataset, demonstrating the importance of feature selection in improving prediction accuracy.

Uploaded by

Daksh Mahajan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views8 pages

Important Notes

Uploaded by

Daksh Mahajan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

1.

To discuss the major issues of data mining, it's helpful to use a simple, relatable
dataset. One of the most beginner-friendly yet real-world relevant datasets is the:

Titanic Survival Dataset

Goal:

Predict whether a passenger survived the Titanic disaster based on features like age, gender,
class, and ticket fare.

Sample Data:

PassengerId Name Sex Age Pclass Fare Survived

1 Braund, Mr. Owen male 22 3 7.25 0

2 Cumings, Mrs. female 38 1 71.28 1

3 Heikkinen, Miss female 26 3 7.92 1

4 Allen, Mr. male 35 1 53.10 0

Survived is the target variable (1 = Survived, 0 = Died).

Major Data Mining Issues (Explained with This Dataset)

1. Data Quality (Missing or Inaccurate Data)

• Some passengers have missing Age or Cabin values.

• Inaccurate data (e.g., Fare = 0 for some 1st class passengers)

Impact: Poor model accuracy; missing values need to be handled via imputation or deletion.

2. Noisy Data

• Names like “Braund, Mr. Owen Harris” are long and inconsistent.

• Titles (Mr., Mrs., Dr., etc.) could be extracted for useful info, but they’re buried in
text.

Impact: Models may misinterpret raw text; you need to clean and extract useful features.

3. Irrelevant or Redundant Features

• PassengerId doesn’t help predict survival — it’s just a unique identifier.

• Ticket and Name are hard to use unless cleaned and engineered.

Impact: Including irrelevant features can reduce model performance.

4. Imbalanced Classes

• Around 38% survived and 62% did not — not heavily imbalanced, but can still affect
performance depending on the model used.
Impact: Class imbalance may lead models to favor the majority class (predict "did not
survive" more often).

5. Data Integration and Compatibility

If you combine Titanic data with other sources (e.g., weather conditions, ship deck layouts),
the formats and time references may not match.

Impact: Difficulties in merging datasets; possible introduction of inconsistencies.

6. Privacy and Ethics Concerns

Though the Titanic dataset is public, real-life data mining (e.g., on hospital or bank records)
can risk:

• Revealing personal identities

• Discriminatory decisions (e.g., predicting survival by gender or class)

Impact: Violating ethical guidelines or legal rules like GDPR.

7. Scalability

This dataset is small, but in real-time systems (like fraud detection), data mining models
need to process millions of records quickly.

Impact: Poorly designed algorithms can become too slow or crash at scale.

Summary Table:

Issue Example from Titanic Dataset Solution

Missing Data Missing Age or Cabin Imputation, deletion

Noisy Data Raw Name field Extract titles like Mr., Mrs., etc.

Irrelevant
PassengerId Drop unused columns
Features

Imbalanced Use class weighting or

More non-survivors than survivors
Classes resampling

Merging Titanic data with external

Data Integration Standardize formats and keys
sources

Be aware of bias, anonymize

Privacy & Ethics Predicting based on gender or class
data

N/A in Titanic, but common in live

Scalability Use distributed systems
systems
2.Elaborate on each step of the data preprocessing process using the House Price
Prediction Dataset

Use Case: House Price Prediction

We have data about houses. Each row is a house, and our goal is to predict how much it will
cost based on its size, number of bedrooms, location, and whether it has parking.

House_ID Area (sqft) Bedrooms Location Parking Price ($)

H001 1000 2 Suburb Yes 150000
H002 1500 3 City Center No 250000
H003 1200 2 Suburb Yes 180000
H004 NaN 3 Town Yes 220000
H005 1100 2 City Center NaN 200000

We need to prepare this data so a machine learning model can understand it and make
predictions.

Detailed Data Preprocessing Steps

1. Data Collection

This is the first step where you gather the data from different sources:

• Real estate websites

• Property agencies

• Excel sheets

• Databases

In our case: We already have a small dataset with 5 houses.

2. Data Cleaning

This step involves checking for:

• Missing values

• Incorrect or inconsistent values

• Outliers (values too high or low to be realistic)

Example Issues in Our Dataset:

• House H004 has missing Area → We'll fill it with the average area of the other
houses.

• House H005 has missing Parking → We'll fill it with the most common value ("Yes" in
our small dataset).
Fixes:

• Fill missing area:

o Average area of others: (1000 + 1500 + 1200 + 1100) / 4 = 1200

o So, replace NaN in H004 with 1200

• Fill missing Parking:

o “Yes” appears 3 times, “No” once → Replace missing with “Yes”

Now the cleaned data looks like:

House_ID Area Bedrooms Location Parking Price

H001 1000 2 Suburb Yes 150000

H002 1500 3 City Center No 250000

H003 1200 2 Suburb Yes 180000

H004 1200 3 Town Yes 220000

H005 1100 2 City Center Yes 200000

3. Data Transformation

Now we convert text values (categorical data) into numerical values because models like
linear regression or decision trees work with numbers.

Convert Location (multi-class category) using One-Hot Encoding:

We create a separate column for each location:

Location_CityCenter Location_Suburb Location_Town

0 1 0

1 0 0

0 1 0

0 0 1

1 0 0

Convert Parking:
• Yes → 1

• No → 0

So the full transformed data:

Area Bedrooms Location_CityCenter Location_Suburb Location_Town Parking Price

1000 2 0 1 0 1 150000

1500 3 1 0 0 0 250000

1200 2 0 1 0 1 180000

1200 3 0 0 1 1 220000

1100 2 1 0 0 1 200000

4. Feature Engineering (optional but useful)

Here we create new features that might help the model.

Example:

• Price_per_sqft = Price / Area

Area Price Price_per_sqft

1000 150000 150

1500 250000 166.67

... ... ...

This gives more insight than just raw price or area.

5. Data Scaling

Some models need all numeric features to be on a similar scale (especially models like SVM,
KNN, or neural networks).

We scale Area, Price, and Price_per_sqft to a range of 0 to 1 or to have a mean of 0 and

standard deviation of 1.

Common tools: MinMaxScaler or StandardScaler from scikit-learn

6. Data Splitting

To evaluate our model correctly, we split the data into:

• Training data (e.g., 80%): Used to train the model

• Test data (e.g., 20%): Used to check if the model performs well on unseen data

In Python:
from sklearn.model_selection import train_test_split
X = features
y = target_price
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Final Overview Table:

Step What Happens Why It Matters

Data Collection Get house data You need data to learn from

Data Cleaning Fix missing or incorrect values Clean data leads to accurate models

Data
Convert text to numbers ML models need numbers to work
Transformation

Feature
Add extra helpful information Can improve prediction accuracy
Engineering

Data Scaling Normalize data ranges Makes training stable and fair

Separate data into training and Prevents overfitting and checks real
Data Splitting
testing performance

3.Find Information Gain using the Weather dataset with a Decision Tree Classifier.

Outlook Temperature Humidity Windy Play

Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rain Mild High False Ye
What is Information Gain?

Information Gain measures how well a feature splits the data into target classes.

High Info Gain → Better at classifying

Low Info Gain → Less useful

It is calculated as:

Let's Calculate Information Gain for Outlook

We use the following formula:

Step 2: Split by Outlook

Outlook = Sunny:

| Outlook | Play |

| ------- | ---- |

| Sunny | No |

| Sunny | Yes |

Outlook = Overcast:

All 4 are Yes → Entropy = 0

Outlook = Rain

:| Outlook | Play |

| ------- | ---- |

| Rain | Yes |

| Rain | No |

| Rain | Yes |

| Rain | No |
• Total = 5 → 3 Yes, 2 No

• Entropy ≈ 0.971

Final Result:

Information Gain for Outlook ≈ 0.245

This tells us that Outlook gives a moderate improvement in predicting Play. The decision tree
will prefer to split on features with higher information gain first.

CC103 Mod3
No ratings yet
CC103 Mod3
12 pages
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
No ratings yet
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
35 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Real Estate ML Project Guide
No ratings yet
Real Estate ML Project Guide
20 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
43 pages
Dawit House
No ratings yet
Dawit House
49 pages
Machine Learning Essentials
No ratings yet
Machine Learning Essentials
86 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
ML Da
No ratings yet
ML Da
55 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
Machine Learning Lec 1
No ratings yet
Machine Learning Lec 1
68 pages
AIMLlatestmodule 2notes Removed
No ratings yet
AIMLlatestmodule 2notes Removed
33 pages
Module 2
No ratings yet
Module 2
35 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
UNIT 2 PART 1 Data Science
No ratings yet
UNIT 2 PART 1 Data Science
49 pages
Features of A Datase1
No ratings yet
Features of A Datase1
11 pages
Oe Cae 3
No ratings yet
Oe Cae 3
7 pages
CSE 445 - Lecture 2 - Data Exploration - Regression
No ratings yet
CSE 445 - Lecture 2 - Data Exploration - Regression
31 pages
Life Lesson
No ratings yet
Life Lesson
13 pages
22K61A0654 2 Sasi Auto
No ratings yet
22K61A0654 2 Sasi Auto
24 pages
AI-Module 4 - Updated
No ratings yet
AI-Module 4 - Updated
53 pages
ML ch-1
No ratings yet
ML ch-1
32 pages
CWH Sklearn Merged
No ratings yet
CWH Sklearn Merged
74 pages
Machine Learning Problem-Solving Steps: 1. Look at The Big Picture
No ratings yet
Machine Learning Problem-Solving Steps: 1. Look at The Big Picture
41 pages
NN 7
No ratings yet
NN 7
26 pages
Machine Learning: Technical Requirements & Data Processing Guide
No ratings yet
Machine Learning: Technical Requirements & Data Processing Guide
30 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
69 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
7 Data Preprocessing Steps in Machine Learning
No ratings yet
7 Data Preprocessing Steps in Machine Learning
5 pages
Unit 2
No ratings yet
Unit 2
78 pages
Assignment 4 MB511
No ratings yet
Assignment 4 MB511
6 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
Classification Analysis
No ratings yet
Classification Analysis
4 pages
Week 10
No ratings yet
Week 10
50 pages
Data Preprocessing and Feature Engineering
No ratings yet
Data Preprocessing and Feature Engineering
32 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
ML Week 8
No ratings yet
ML Week 8
12 pages
Practical Guide and Concepts Data Mining
No ratings yet
Practical Guide and Concepts Data Mining
63 pages
Project Report
No ratings yet
Project Report
37 pages
Machine Learning Path
No ratings yet
Machine Learning Path
21 pages
Machine Learning Usefull Things
No ratings yet
Machine Learning Usefull Things
18 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Classification With Decision Trees I: Instructor: Qiang Yang
No ratings yet
Classification With Decision Trees I: Instructor: Qiang Yang
29 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
No ratings yet
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
69 pages
ML Lab Manual
No ratings yet
ML Lab Manual
110 pages
7118 Ds Methodology Ss
No ratings yet
7118 Ds Methodology Ss
56 pages
02 Input Output
No ratings yet
02 Input Output
44 pages
NguyenThanhNam ITCSIU22311 Lab5
No ratings yet
NguyenThanhNam ITCSIU22311 Lab5
9 pages
Question Bank DLCOA
No ratings yet
Question Bank DLCOA
2 pages
Unit 2 - Physical Layer
No ratings yet
Unit 2 - Physical Layer
8 pages
Computer Network - Unit 1
No ratings yet
Computer Network - Unit 1
21 pages
(Statistical Technique) M Iii Formulas
No ratings yet
(Statistical Technique) M Iii Formulas
3 pages
Few Problems For Reference From Assignment (QB)
No ratings yet
Few Problems For Reference From Assignment (QB)
21 pages
Small Sample Test
No ratings yet
Small Sample Test
11 pages
AOA Assignment
No ratings yet
AOA Assignment
1 page
DBMS Ia 2
No ratings yet
DBMS Ia 2
58 pages
EM IV - Question - Module 3,4,5 - 2024-25
No ratings yet
EM IV - Question - Module 3,4,5 - 2024-25
2 pages
Os QB
No ratings yet
Os QB
36 pages
Interface Management On Megaprojects: A Case Study
No ratings yet
Interface Management On Megaprojects: A Case Study
6 pages
STL ToneHub v2.0 User Manual
No ratings yet
STL ToneHub v2.0 User Manual
76 pages
Printed 黃建華Oracle - EBS Workflow
No ratings yet
Printed 黃建華Oracle - EBS Workflow
90 pages
Information For Admission: FIITJEE Talent Reward Exam
No ratings yet
Information For Admission: FIITJEE Talent Reward Exam
3 pages
Bengali Text Classification Distinguishing Saintly and Common Forms Using Machine Learning Model
No ratings yet
Bengali Text Classification Distinguishing Saintly and Common Forms Using Machine Learning Model
7 pages
Toc QB It
No ratings yet
Toc QB It
15 pages
June 2024 (v1) MS P1 IGCSE MATHEMATICS (CORE)
No ratings yet
June 2024 (v1) MS P1 IGCSE MATHEMATICS (CORE)
7 pages
DE 3000 Brochure
No ratings yet
DE 3000 Brochure
4 pages
Self Cleaning NH4 N Modbus Instruction en
No ratings yet
Self Cleaning NH4 N Modbus Instruction en
21 pages
Project Diary - Major
No ratings yet
Project Diary - Major
12 pages
Gek106913 B
No ratings yet
Gek106913 B
4 pages
MIL Module 2
No ratings yet
MIL Module 2
2 pages
CWT-UWD-SD RS485 Ultrasonic Wind Speed and Direction Sensor Manual
100% (2)
CWT-UWD-SD RS485 Ultrasonic Wind Speed and Direction Sensor Manual
5 pages
Image Analytics, Unit-3
No ratings yet
Image Analytics, Unit-3
12 pages
Mini Mk8 MM Manual 24.07.2020
No ratings yet
Mini Mk8 MM Manual 24.07.2020
194 pages
Organized (1) (AutoRecovered)
No ratings yet
Organized (1) (AutoRecovered)
37 pages
Nti Serimux S 16 Ds
No ratings yet
Nti Serimux S 16 Ds
4 pages
Fitness Course Enrolment Guide
No ratings yet
Fitness Course Enrolment Guide
16 pages
Hands-On Exercise No. 4 Batch-10 Graphic Design Total Marks: 10 Due Date: 19/08/2021
No ratings yet
Hands-On Exercise No. 4 Batch-10 Graphic Design Total Marks: 10 Due Date: 19/08/2021
3 pages
03U0095EN
No ratings yet
03U0095EN
20 pages
Assignment Problems: Paul Dawkins
No ratings yet
Assignment Problems: Paul Dawkins
176 pages
Curriculum Vitae: Profile
No ratings yet
Curriculum Vitae: Profile
34 pages
Candidate Supervision Declaration Form Preparation Form 7 - 0417 32
No ratings yet
Candidate Supervision Declaration Form Preparation Form 7 - 0417 32
2 pages
LaTeX Homework Help Service
100% (1)
LaTeX Homework Help Service
6 pages
DXB3100 Radio 2212 B20 Ericsson Faulty Report
No ratings yet
DXB3100 Radio 2212 B20 Ericsson Faulty Report
1 page
vb6 Array Types Continued
No ratings yet
vb6 Array Types Continued
4 pages
Ensemble-Based Botnet Attack Detection and Classification Using Machine Learning Algorithms On NBaIoT Dataset
No ratings yet
Ensemble-Based Botnet Attack Detection and Classification Using Machine Learning Algorithms On NBaIoT Dataset
6 pages
G-LBDA1361NA 002 Tattle-Tape Workstation Datasheet LR2
No ratings yet
G-LBDA1361NA 002 Tattle-Tape Workstation Datasheet LR2
2 pages
Konnwei Kw310 Can Obdii+Eobd Code Reader: Specifications
No ratings yet
Konnwei Kw310 Can Obdii+Eobd Code Reader: Specifications
16 pages