Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views8 pages

Important Notes

The document discusses data mining issues using the Titanic Survival Dataset, highlighting challenges such as data quality, noisy data, irrelevant features, class imbalance, data integration, privacy concerns, and scalability. It also outlines data preprocessing steps for a House Price Prediction dataset, including data collection, cleaning, transformation, feature engineering, scaling, and splitting. Finally, it explains how to calculate Information Gain using a Decision Tree Classifier with a Weather dataset, demonstrating the importance of feature selection in improving prediction accuracy.

Uploaded by

Daksh Mahajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views8 pages

Important Notes

The document discusses data mining issues using the Titanic Survival Dataset, highlighting challenges such as data quality, noisy data, irrelevant features, class imbalance, data integration, privacy concerns, and scalability. It also outlines data preprocessing steps for a House Price Prediction dataset, including data collection, cleaning, transformation, feature engineering, scaling, and splitting. Finally, it explains how to calculate Information Gain using a Decision Tree Classifier with a Weather dataset, demonstrating the importance of feature selection in improving prediction accuracy.

Uploaded by

Daksh Mahajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

1.

To discuss the major issues of data mining, it's helpful to use a simple, relatable
dataset. One of the most beginner-friendly yet real-world relevant datasets is the:

Titanic Survival Dataset

Goal:

Predict whether a passenger survived the Titanic disaster based on features like age, gender,
class, and ticket fare.

Sample Data:

PassengerId Name Sex Age Pclass Fare Survived

1 Braund, Mr. Owen male 22 3 7.25 0

2 Cumings, Mrs. female 38 1 71.28 1

3 Heikkinen, Miss female 26 3 7.92 1

4 Allen, Mr. male 35 1 53.10 0

Survived is the target variable (1 = Survived, 0 = Died).

Major Data Mining Issues (Explained with This Dataset)

1. Data Quality (Missing or Inaccurate Data)

• Some passengers have missing Age or Cabin values.

• Inaccurate data (e.g., Fare = 0 for some 1st class passengers)

Impact: Poor model accuracy; missing values need to be handled via imputation or deletion.

2. Noisy Data

• Names like “Braund, Mr. Owen Harris” are long and inconsistent.

• Titles (Mr., Mrs., Dr., etc.) could be extracted for useful info, but they’re buried in
text.

Impact: Models may misinterpret raw text; you need to clean and extract useful features.

3. Irrelevant or Redundant Features

• PassengerId doesn’t help predict survival — it’s just a unique identifier.

• Ticket and Name are hard to use unless cleaned and engineered.

Impact: Including irrelevant features can reduce model performance.

4. Imbalanced Classes

• Around 38% survived and 62% did not — not heavily imbalanced, but can still affect
performance depending on the model used.
Impact: Class imbalance may lead models to favor the majority class (predict "did not
survive" more often).

5. Data Integration and Compatibility

If you combine Titanic data with other sources (e.g., weather conditions, ship deck layouts),
the formats and time references may not match.

Impact: Difficulties in merging datasets; possible introduction of inconsistencies.

6. Privacy and Ethics Concerns

Though the Titanic dataset is public, real-life data mining (e.g., on hospital or bank records)
can risk:

• Revealing personal identities

• Discriminatory decisions (e.g., predicting survival by gender or class)

Impact: Violating ethical guidelines or legal rules like GDPR.

7. Scalability

This dataset is small, but in real-time systems (like fraud detection), data mining models
need to process millions of records quickly.

Impact: Poorly designed algorithms can become too slow or crash at scale.

Summary Table:

Issue Example from Titanic Dataset Solution

Missing Data Missing Age or Cabin Imputation, deletion

Noisy Data Raw Name field Extract titles like Mr., Mrs., etc.

Irrelevant
PassengerId Drop unused columns
Features

Imbalanced Use class weighting or


More non-survivors than survivors
Classes resampling

Merging Titanic data with external


Data Integration Standardize formats and keys
sources

Be aware of bias, anonymize


Privacy & Ethics Predicting based on gender or class
data

N/A in Titanic, but common in live


Scalability Use distributed systems
systems
2.Elaborate on each step of the data preprocessing process using the House Price
Prediction Dataset

Use Case: House Price Prediction

We have data about houses. Each row is a house, and our goal is to predict how much it will
cost based on its size, number of bedrooms, location, and whether it has parking.

House_ID Area (sqft) Bedrooms Location Parking Price ($)


H001 1000 2 Suburb Yes 150000
H002 1500 3 City Center No 250000
H003 1200 2 Suburb Yes 180000
H004 NaN 3 Town Yes 220000
H005 1100 2 City Center NaN 200000

We need to prepare this data so a machine learning model can understand it and make
predictions.

Detailed Data Preprocessing Steps

1. Data Collection

This is the first step where you gather the data from different sources:

• Real estate websites

• Property agencies

• Excel sheets

• Databases

In our case: We already have a small dataset with 5 houses.

2. Data Cleaning

This step involves checking for:

• Missing values

• Incorrect or inconsistent values

• Outliers (values too high or low to be realistic)

Example Issues in Our Dataset:

• House H004 has missing Area → We'll fill it with the average area of the other
houses.

• House H005 has missing Parking → We'll fill it with the most common value ("Yes" in
our small dataset).
Fixes:

• Fill missing area:

o Average area of others: (1000 + 1500 + 1200 + 1100) / 4 = 1200

o So, replace NaN in H004 with 1200

• Fill missing Parking:

o “Yes” appears 3 times, “No” once → Replace missing with “Yes”

Now the cleaned data looks like:

House_ID Area Bedrooms Location Parking Price

H001 1000 2 Suburb Yes 150000

H002 1500 3 City Center No 250000

H003 1200 2 Suburb Yes 180000

H004 1200 3 Town Yes 220000

H005 1100 2 City Center Yes 200000

3. Data Transformation

Now we convert text values (categorical data) into numerical values because models like
linear regression or decision trees work with numbers.

Convert Location (multi-class category) using One-Hot Encoding:

We create a separate column for each location:

Location_CityCenter Location_Suburb Location_Town

0 1 0

1 0 0

0 1 0

0 0 1

1 0 0

Convert Parking:
• Yes → 1

• No → 0

So the full transformed data:

Area Bedrooms Location_CityCenter Location_Suburb Location_Town Parking Price

1000 2 0 1 0 1 150000

1500 3 1 0 0 0 250000

1200 2 0 1 0 1 180000

1200 3 0 0 1 1 220000

1100 2 1 0 0 1 200000

4. Feature Engineering (optional but useful)

Here we create new features that might help the model.

Example:

• Price_per_sqft = Price / Area

Area Price Price_per_sqft

1000 150000 150

1500 250000 166.67

... ... ...

This gives more insight than just raw price or area.

5. Data Scaling

Some models need all numeric features to be on a similar scale (especially models like SVM,
KNN, or neural networks).

We scale Area, Price, and Price_per_sqft to a range of 0 to 1 or to have a mean of 0 and


standard deviation of 1.

Common tools: MinMaxScaler or StandardScaler from scikit-learn

6. Data Splitting

To evaluate our model correctly, we split the data into:

• Training data (e.g., 80%): Used to train the model

• Test data (e.g., 20%): Used to check if the model performs well on unseen data

In Python:
from sklearn.model_selection import train_test_split
X = features
y = target_price
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Final Overview Table:

Step What Happens Why It Matters

Data Collection Get house data You need data to learn from

Data Cleaning Fix missing or incorrect values Clean data leads to accurate models

Data
Convert text to numbers ML models need numbers to work
Transformation

Feature
Add extra helpful information Can improve prediction accuracy
Engineering

Data Scaling Normalize data ranges Makes training stable and fair

Separate data into training and Prevents overfitting and checks real
Data Splitting
testing performance

3.Find Information Gain using the Weather dataset with a Decision Tree Classifier.

Outlook Temperature Humidity Windy Play


Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rain Mild High False Ye
What is Information Gain?

Information Gain measures how well a feature splits the data into target classes.

High Info Gain → Better at classifying

Low Info Gain → Less useful

It is calculated as:

Let's Calculate Information Gain for Outlook

We use the following formula:


Step 2: Split by Outlook

Outlook = Sunny:

| Outlook | Play |

| ------- | ---- |

| Sunny | No |

| Sunny | No |

| Sunny | No |

| Sunny | Yes |

| Sunny | Yes |

Outlook = Overcast:

All 4 are Yes → Entropy = 0

Outlook = Rain

:| Outlook | Play |

| ------- | ---- |

| Rain | Yes |

| Rain | Yes |

| Rain | No |

| Rain | Yes |

| Rain | No |
• Total = 5 → 3 Yes, 2 No

• Entropy ≈ 0.971

Final Result:

Information Gain for Outlook ≈ 0.245

This tells us that Outlook gives a moderate improvement in predicting Play. The decision tree
will prefer to split on features with higher information gain first.

You might also like