1.
To discuss the major issues of data mining, it's helpful to use a simple, relatable
dataset. One of the most beginner-friendly yet real-world relevant datasets is the:
Titanic Survival Dataset
Goal:
Predict whether a passenger survived the Titanic disaster based on features like age, gender,
class, and ticket fare.
Sample Data:
PassengerId Name Sex Age Pclass Fare Survived
1 Braund, Mr. Owen male 22 3 7.25 0
2 Cumings, Mrs. female 38 1 71.28 1
3 Heikkinen, Miss female 26 3 7.92 1
4 Allen, Mr. male 35 1 53.10 0
Survived is the target variable (1 = Survived, 0 = Died).
Major Data Mining Issues (Explained with This Dataset)
1. Data Quality (Missing or Inaccurate Data)
• Some passengers have missing Age or Cabin values.
• Inaccurate data (e.g., Fare = 0 for some 1st class passengers)
Impact: Poor model accuracy; missing values need to be handled via imputation or deletion.
2. Noisy Data
• Names like “Braund, Mr. Owen Harris” are long and inconsistent.
• Titles (Mr., Mrs., Dr., etc.) could be extracted for useful info, but they’re buried in
text.
Impact: Models may misinterpret raw text; you need to clean and extract useful features.
3. Irrelevant or Redundant Features
• PassengerId doesn’t help predict survival — it’s just a unique identifier.
• Ticket and Name are hard to use unless cleaned and engineered.
Impact: Including irrelevant features can reduce model performance.
4. Imbalanced Classes
• Around 38% survived and 62% did not — not heavily imbalanced, but can still affect
performance depending on the model used.
Impact: Class imbalance may lead models to favor the majority class (predict "did not
survive" more often).
5. Data Integration and Compatibility
If you combine Titanic data with other sources (e.g., weather conditions, ship deck layouts),
the formats and time references may not match.
Impact: Difficulties in merging datasets; possible introduction of inconsistencies.
6. Privacy and Ethics Concerns
Though the Titanic dataset is public, real-life data mining (e.g., on hospital or bank records)
can risk:
• Revealing personal identities
• Discriminatory decisions (e.g., predicting survival by gender or class)
Impact: Violating ethical guidelines or legal rules like GDPR.
7. Scalability
This dataset is small, but in real-time systems (like fraud detection), data mining models
need to process millions of records quickly.
Impact: Poorly designed algorithms can become too slow or crash at scale.
Summary Table:
Issue Example from Titanic Dataset Solution
Missing Data Missing Age or Cabin Imputation, deletion
Noisy Data Raw Name field Extract titles like Mr., Mrs., etc.
Irrelevant
PassengerId Drop unused columns
Features
Imbalanced Use class weighting or
More non-survivors than survivors
Classes resampling
Merging Titanic data with external
Data Integration Standardize formats and keys
sources
Be aware of bias, anonymize
Privacy & Ethics Predicting based on gender or class
data
N/A in Titanic, but common in live
Scalability Use distributed systems
systems
2.Elaborate on each step of the data preprocessing process using the House Price
Prediction Dataset
Use Case: House Price Prediction
We have data about houses. Each row is a house, and our goal is to predict how much it will
cost based on its size, number of bedrooms, location, and whether it has parking.
House_ID Area (sqft) Bedrooms Location Parking Price ($)
H001 1000 2 Suburb Yes 150000
H002 1500 3 City Center No 250000
H003 1200 2 Suburb Yes 180000
H004 NaN 3 Town Yes 220000
H005 1100 2 City Center NaN 200000
We need to prepare this data so a machine learning model can understand it and make
predictions.
Detailed Data Preprocessing Steps
1. Data Collection
This is the first step where you gather the data from different sources:
• Real estate websites
• Property agencies
• Excel sheets
• Databases
In our case: We already have a small dataset with 5 houses.
2. Data Cleaning
This step involves checking for:
• Missing values
• Incorrect or inconsistent values
• Outliers (values too high or low to be realistic)
Example Issues in Our Dataset:
• House H004 has missing Area → We'll fill it with the average area of the other
houses.
• House H005 has missing Parking → We'll fill it with the most common value ("Yes" in
our small dataset).
Fixes:
• Fill missing area:
o Average area of others: (1000 + 1500 + 1200 + 1100) / 4 = 1200
o So, replace NaN in H004 with 1200
• Fill missing Parking:
o “Yes” appears 3 times, “No” once → Replace missing with “Yes”
Now the cleaned data looks like:
House_ID Area Bedrooms Location Parking Price
H001 1000 2 Suburb Yes 150000
H002 1500 3 City Center No 250000
H003 1200 2 Suburb Yes 180000
H004 1200 3 Town Yes 220000
H005 1100 2 City Center Yes 200000
3. Data Transformation
Now we convert text values (categorical data) into numerical values because models like
linear regression or decision trees work with numbers.
Convert Location (multi-class category) using One-Hot Encoding:
We create a separate column for each location:
Location_CityCenter Location_Suburb Location_Town
0 1 0
1 0 0
0 1 0
0 0 1
1 0 0
Convert Parking:
• Yes → 1
• No → 0
So the full transformed data:
Area Bedrooms Location_CityCenter Location_Suburb Location_Town Parking Price
1000 2 0 1 0 1 150000
1500 3 1 0 0 0 250000
1200 2 0 1 0 1 180000
1200 3 0 0 1 1 220000
1100 2 1 0 0 1 200000
4. Feature Engineering (optional but useful)
Here we create new features that might help the model.
Example:
• Price_per_sqft = Price / Area
Area Price Price_per_sqft
1000 150000 150
1500 250000 166.67
... ... ...
This gives more insight than just raw price or area.
5. Data Scaling
Some models need all numeric features to be on a similar scale (especially models like SVM,
KNN, or neural networks).
We scale Area, Price, and Price_per_sqft to a range of 0 to 1 or to have a mean of 0 and
standard deviation of 1.
Common tools: MinMaxScaler or StandardScaler from scikit-learn
6. Data Splitting
To evaluate our model correctly, we split the data into:
• Training data (e.g., 80%): Used to train the model
• Test data (e.g., 20%): Used to check if the model performs well on unseen data
In Python:
from sklearn.model_selection import train_test_split
X = features
y = target_price
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Final Overview Table:
Step What Happens Why It Matters
Data Collection Get house data You need data to learn from
Data Cleaning Fix missing or incorrect values Clean data leads to accurate models
Data
Convert text to numbers ML models need numbers to work
Transformation
Feature
Add extra helpful information Can improve prediction accuracy
Engineering
Data Scaling Normalize data ranges Makes training stable and fair
Separate data into training and Prevents overfitting and checks real
Data Splitting
testing performance
3.Find Information Gain using the Weather dataset with a Decision Tree Classifier.
Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rain Mild High False Ye
What is Information Gain?
Information Gain measures how well a feature splits the data into target classes.
High Info Gain → Better at classifying
Low Info Gain → Less useful
It is calculated as:
Let's Calculate Information Gain for Outlook
We use the following formula:
Step 2: Split by Outlook
Outlook = Sunny:
| Outlook | Play |
| ------- | ---- |
| Sunny | No |
| Sunny | No |
| Sunny | No |
| Sunny | Yes |
| Sunny | Yes |
Outlook = Overcast:
All 4 are Yes → Entropy = 0
Outlook = Rain
:| Outlook | Play |
| ------- | ---- |
| Rain | Yes |
| Rain | Yes |
| Rain | No |
| Rain | Yes |
| Rain | No |
• Total = 5 → 3 Yes, 2 No
• Entropy ≈ 0.971
Final Result:
Information Gain for Outlook ≈ 0.245
This tells us that Outlook gives a moderate improvement in predicting Play. The decision tree
will prefer to split on features with higher information gain first.