0% found this document useful (0 votes)

8 views10 pages

Asg One

The document outlines a data preprocessing workflow for a dataset from the UCI Bank Marketing repository, covering steps such as loading the dataset, handling missing values, detecting and treating outliers, encoding categorical variables, and performing correlation analysis. It emphasizes the importance of visualizations like boxplots and histograms to understand data distribution and relationships. Finally, it discusses feature selection techniques to simplify the model and reduce overfitting.

Uploaded by

mayurjogade2121

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views10 pages

Asg One

Uploaded by

mayurjogade2121

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

1 – Load Dataset

We start by importing the dataset from the UCI Bank Marketing repository. The head() method
gives us a quick look at the first few rows, while isnull().sum() counts missing values for each
column. This initial inspection helps us understand data completeness and identify if data
imputation is needed.

import pandas as pd

df = pd.read_csv("bank.csv", sep=";")
df.head()

age job marital education default balance housing loan

\
0 30 unemployed married primary no 1787 no no

1 33 services married secondary no 4789 yes yes

2 35 management single tertiary no 1350 yes no

3 30 management married tertiary no 1476 yes yes

4 59 blue-collar married secondary no 0 yes no

contact day month duration campaign pdays previous poutcome

y
0 cellular 19 oct 79 1 -1 0 unknown
no
1 cellular 11 may 220 1 339 4 failure
no
2 cellular 16 apr 185 1 330 1 failure
no
3 unknown 3 jun 199 4 -1 0 unknown
no
4 unknown 5 may 226 1 -1 0 unknown
no

df.isnull().sum()

age 0
job 0
marital 0
education 0
default 0
balance 0
housing 0
loan 0
contact 0
day 0
month 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
y 0
dtype: int64

df.columns

Index(['age', 'default', 'balance', 'housing', 'loan', 'day',

'duration',
'campaign', 'pdays', 'previous', 'y', 'job_blue-collar',
'job_entrepreneur', 'job_housemaid', 'job_management',
'job_retired',
'job_self-employed', 'job_services', 'job_student',
'job_technician',
'job_unemployed', 'job_unknown', 'marital_married',
'marital_single',
'education_secondary', 'education_tertiary',
'education_unknown',
'contact_telephone', 'contact_unknown', 'month_aug',
'month_dec',
'month_feb', 'month_jan', 'month_jul', 'month_jun',
'month_mar',
'month_may', 'month_nov', 'month_oct', 'month_sep',
'poutcome_other',
'poutcome_success', 'poutcome_unknown', 'y_num'],
dtype='object')

2 – Handle Missing Values

Missing values can bias our analysis and break machine learning models.

Numeric columns: We use the median, which is robust to outliers.

Categorical columns: We use the mode (most frequent value). After this step, all missing values
should be eliminated.

import numpy as np

# Fill numeric with median

for col in df.select_dtypes(include=np.number).columns:
df[col] = df[col].fillna(df[col].median())

# Fill categorical with mode

for col in df.select_dtypes(include='object').columns:
df[col] = df[col].fillna(df[col].mode()[0])
df.isnull().sum()

age 0
job 0
marital 0
education 0
default 0
balance 0
housing 0
loan 0
contact 0
day 0
month 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
y 0
dtype: int64

3 – Boxplots Before Outlier Treatment

Boxplots are used to visually detect outliers.

Points lying far outside the "whiskers" are likely extreme values.

Outliers can distort mean and variance, which impacts models like Linear Regression.

import matplotlib.pyplot as plt

import seaborn as sns

numeric_cols = df.select_dtypes(include=np.number).columns.tolist()

fig, axes = plt.subplots(nrows=len(numeric_cols)//3 + 1, ncols=3,

figsize=(15, 8))
axes = axes.flatten()

for i, col in enumerate(numeric_cols):

sns.boxplot(x=df[col], ax=axes[i])
axes[i].set_title(col)

plt.tight_layout()
plt.show()
4 – Treat Outliers
We apply the IQR method to detect outliers:

Calculate Q1 (25th percentile) and Q3 (75th percentile).

Any value outside Q1 - 1.5IQR or Q3 + 1.5IQR is considered an outlier.

Outliers are replaced with the median to minimize skewness.

for col in numeric_cols:

Q1, Q3 = df[col].quantile([0.25, 0.75])
IQR = Q3 - Q1
lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
median_val = df[col].median()
df[col] = np.where((df[col] < lower) | (df[col] > upper),
median_val, df[col])

5 – Encode Categorical Variables

Machine learning models require numerical input:

Binary categorical features: Label Encoding (0/1).

Multi-class features: One-Hot Encoding, creating dummy variables.

drop_first=True prevents multicollinearity by removing one category.

from sklearn.preprocessing import LabelEncoder

categorical_cols = df.select_dtypes(include='object').columns.tolist()
label_enc = LabelEncoder()

for col in categorical_cols:

if df[col].nunique() == 2:
df[col] = label_enc.fit_transform(df[col])
else:
df = pd.get_dummies(df, columns=[col], drop_first=True)

6 – Plot Distributions
Histograms with KDE curves show:

Data spread (range, central tendency)

Skewness (positive or negative)

Whether transformations may be needed to normalize the data.

fig, axes = plt.subplots(nrows=len(numeric_cols)//3 + 1, ncols=3,

figsize=(15, 8))
axes = axes.flatten()

for i, col in enumerate(numeric_cols):

sns.histplot(df[col], kde=True, ax=axes[i])
axes[i].set_title(f"{col} Distribution")

plt.tight_layout()
plt.show()
#1. Target Variable Distribution
import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x='y', data=df, palette='viridis')

plt.title('Target Variable Distribution (y)')
plt.xlabel('Subscribed to Term Deposit')
plt.ylabel('Count')
plt.show()

C:\Users\jogad\AppData\Local\Temp\ipykernel_11976\3322926914.py:5:
FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be

removed in v0.14.0. Assign the `x` variable to `hue` and set
`legend=False` for the same effect.

sns.countplot(x='y', data=df, palette='viridis')

import matplotlib.pyplot as plt

import seaborn as sns

# Age distribution of subscribers vs non-subscribers

plt.figure(figsize=(8, 5))
sns.kdeplot(data=df, x='age', hue='y', fill=True)
plt.title('Age Distribution by Subscription Status')
plt.show()

plt.figure(figsize=(8,5))
sns.histplot(data=df, x='age', hue='y', kde=True, palette='coolwarm',
element='step')
plt.title('Age Distribution by Subscription Status')
plt.show()
7 – Correlation Analysis
Correlation measures how two variables move together:

Values near +1 → strong positive correlation.

Values near -1 → strong negative correlation.

Helps detect redundant features, which may be dropped.

import matplotlib.pyplot as plt

import seaborn as sns

# Select a random subset of 8 numerical columns

num_cols = df.select_dtypes(include='number').columns
sample_cols = num_cols.to_series().sample(8, random_state=42)

# Compute correlation matrix for these columns only

corr_matrix = df[sample_cols].corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix (Random 8 Features)")
plt.show()
8 – Feature Selection
Variance Threshold: Removes features with no variation (constant values).

SelectKBest: Selects top k features using the ANOVA F-test for classification.

This step helps simplify the model and reduce overfitting.

from sklearn.feature_selection import VarianceThreshold, SelectKBest,

f_classif

target_col = 'y' if 'y' in df.columns else df.columns[-1]

X = df.drop(columns=[target_col])
y = df[target_col]

vt = VarianceThreshold(threshold=0.0)
X_vt = vt.fit_transform(X)
selected_vt = X.columns[vt.get_support()]
selector = SelectKBest(score_func=f_classif, k=min(5, X_vt.shape[1]))
X_kbest = selector.fit_transform(X_vt, y)
selected_kbest = selected_vt[selector.get_support()]

print("After Variance Threshold:", selected_vt.tolist())

print("Top features by SelectKBest:", selected_kbest.tolist())

After Variance Threshold: ['age', 'default', 'balance', 'housing',

'loan', 'day', 'duration', 'campaign', 'job_blue-collar',
'job_entrepreneur', 'job_housemaid', 'job_management', 'job_retired',
'job_self-employed', 'job_services', 'job_student', 'job_technician',
'job_unemployed', 'job_unknown', 'marital_married', 'marital_single',
'education_secondary', 'education_tertiary', 'education_unknown',
'contact_telephone', 'contact_unknown', 'month_aug', 'month_dec',
'month_feb', 'month_jan', 'month_jul', 'month_jun', 'month_mar',
'month_may', 'month_nov', 'month_oct', 'month_sep', 'poutcome_other',
'poutcome_success', 'poutcome_unknown']
Top features by SelectKBest: ['duration', 'contact_unknown',
'month_oct', 'poutcome_success', 'poutcome_unknown']

Solution Key Comprehensive Question Paper
No ratings yet
Solution Key Comprehensive Question Paper
8 pages
Building Logistic Regression Model in Python
No ratings yet
Building Logistic Regression Model in Python
24 pages
Data Pre Processing and Cleaning
No ratings yet
Data Pre Processing and Cleaning
56 pages
Statistics IMP Questions and Answers
No ratings yet
Statistics IMP Questions and Answers
23 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Germany Credit Analysis
No ratings yet
Germany Credit Analysis
41 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
Data Pre Processing and Cleaning
No ratings yet
Data Pre Processing and Cleaning
23 pages
Bank Marketing Targets 1724510938
No ratings yet
Bank Marketing Targets 1724510938
13 pages
Exp 8 - LM
No ratings yet
Exp 8 - LM
10 pages
Kunal Assignment 3
No ratings yet
Kunal Assignment 3
19 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Bank Marketing Ingles
No ratings yet
Bank Marketing Ingles
37 pages
Classification - Bank - Marketing - Dataset - Jupyter Notebook
No ratings yet
Classification - Bank - Marketing - Dataset - Jupyter Notebook
23 pages
Semi-Automated EDA in Python
No ratings yet
Semi-Automated EDA in Python
3 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Machine Learning: Technical Requirements & Data Processing Guide
No ratings yet
Machine Learning: Technical Requirements & Data Processing Guide
30 pages
Ensemble Techniques Project
100% (2)
Ensemble Techniques Project
28 pages
Zindi Financial Inclusion Guide
No ratings yet
Zindi Financial Inclusion Guide
12 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Pandas For Machine Learning
No ratings yet
Pandas For Machine Learning
10 pages
Bank Rpubs
No ratings yet
Bank Rpubs
24 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Python Pandas: 12 Data Manipulation Techniques
100% (2)
Python Pandas: 12 Data Manipulation Techniques
19 pages
Animesh Jain
No ratings yet
Animesh Jain
13 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
PA v0.25
No ratings yet
PA v0.25
18 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Churn Prediction - Commercial Use of Data Science
No ratings yet
Churn Prediction - Commercial Use of Data Science
25 pages
Lab2!17!07-2025 - Demonstrate Various Data Pre-Processing Techniques For A Given Dataset.
No ratings yet
Lab2!17!07-2025 - Demonstrate Various Data Pre-Processing Techniques For A Given Dataset.
17 pages
Naive Bayes Vs Logistic Regression
No ratings yet
Naive Bayes Vs Logistic Regression
16 pages
Data Analysis in The Banking Sector: Pandas Fundamentals
No ratings yet
Data Analysis in The Banking Sector: Pandas Fundamentals
16 pages
Module 3
No ratings yet
Module 3
108 pages
ML ch-1
No ratings yet
ML ch-1
32 pages
Python Code For Loan Default Prediction
No ratings yet
Python Code For Loan Default Prediction
4 pages
FOUND. DATA SCIENCE Practical
No ratings yet
FOUND. DATA SCIENCE Practical
15 pages
Data Preprocessing 1
No ratings yet
Data Preprocessing 1
6 pages
Observation: Import As Import As Import As Import As
No ratings yet
Observation: Import As Import As Import As Import As
31 pages
Churn Prediction Model
No ratings yet
Churn Prediction Model
36 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
Business Analytics
No ratings yet
Business Analytics
56 pages
Lesson 1 - Data Visualisation
No ratings yet
Lesson 1 - Data Visualisation
35 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
Data Analytics
No ratings yet
Data Analytics
34 pages
Exp 343
No ratings yet
Exp 343
18 pages
Data Visualization & Preprocessing Guide
No ratings yet
Data Visualization & Preprocessing Guide
18 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
DM Lab Cycle 1
No ratings yet
DM Lab Cycle 1
12 pages
Edp 3
No ratings yet
Edp 3
16 pages
Kunal DA-12 Assignment-4
No ratings yet
Kunal DA-12 Assignment-4
26 pages
Unit7 Working With Pandas - Solved
No ratings yet
Unit7 Working With Pandas - Solved
12 pages
DADM Unit 5 Programs
No ratings yet
DADM Unit 5 Programs
63 pages
Data Analysis for Banking Insights
No ratings yet
Data Analysis for Banking Insights
2 pages
Dev Record Aids
No ratings yet
Dev Record Aids
24 pages
Manual
No ratings yet
Manual
48 pages
The New Religion of Risk Management: Peter L. Bernstein
No ratings yet
The New Religion of Risk Management: Peter L. Bernstein
5 pages
DPS FINAL MATHS PAPER 2023 (1) (Practice)
No ratings yet
DPS FINAL MATHS PAPER 2023 (1) (Practice)
4 pages
Tools For Technology Enhanced Flexible Learning
No ratings yet
Tools For Technology Enhanced Flexible Learning
10 pages
Vedic Mathematics
No ratings yet
Vedic Mathematics
46 pages
Lab Manual Microprocessor and Microcontrollers: Nehru College of Engineering and Research Centre, Pampady
No ratings yet
Lab Manual Microprocessor and Microcontrollers: Nehru College of Engineering and Research Centre, Pampady
34 pages
Class 9 L-8 Motion Quick Revision Notes
No ratings yet
Class 9 L-8 Motion Quick Revision Notes
25 pages
Dose Effectiveness Analysis
No ratings yet
Dose Effectiveness Analysis
71 pages
Intro to Statistics for Students
No ratings yet
Intro to Statistics for Students
28 pages
Python Basics for Beginners
No ratings yet
Python Basics for Beginners
11 pages
Class - 10 Math Notes Chapter - 11 Constructions
No ratings yet
Class - 10 Math Notes Chapter - 11 Constructions
54 pages
Session 2-Introduction To Data Analytics
No ratings yet
Session 2-Introduction To Data Analytics
24 pages
10 Spheres
No ratings yet
10 Spheres
2 pages
COSC 3101A - Design and Analysis of Algorithms 7
No ratings yet
COSC 3101A - Design and Analysis of Algorithms 7
50 pages
Topology Optimization for Engineers
No ratings yet
Topology Optimization for Engineers
14 pages
10th Maths - Monday Test-2
No ratings yet
10th Maths - Monday Test-2
8 pages
DLP MATH 8 Q1 WEEK 6 Day 3
No ratings yet
DLP MATH 8 Q1 WEEK 6 Day 3
9 pages
International Math Bowl Open 2024
No ratings yet
International Math Bowl Open 2024
9 pages
Option Delta With Skew Adjustment
100% (1)
Option Delta With Skew Adjustment
33 pages
References: D Dy DZ D Dy DZ D DX DZ D DX DZ D D D D D Dy DZ Ydydz
No ratings yet
References: D Dy DZ D Dy DZ D DX DZ D DX DZ D D D D D Dy DZ Ydydz
5 pages
JPEG Standard, MPEG and Recognition
No ratings yet
JPEG Standard, MPEG and Recognition
32 pages
2D Array Addressing & Algorithms
No ratings yet
2D Array Addressing & Algorithms
12 pages
Ethiopian & Gregorian Digital Calendar Design
No ratings yet
Ethiopian & Gregorian Digital Calendar Design
33 pages
Prof Ed N5 Assessment of Learning
No ratings yet
Prof Ed N5 Assessment of Learning
32 pages
Risks: Machine Learning in P&C Insurance: A Review For Pricing and Reserving
No ratings yet
Risks: Machine Learning in P&C Insurance: A Review For Pricing and Reserving
26 pages
Lagrangian and Hamiltonian PDF
No ratings yet
Lagrangian and Hamiltonian PDF
10 pages
Does The Environmental Kuznets Curve Exist? An International Study
No ratings yet
Does The Environmental Kuznets Curve Exist? An International Study
22 pages
Lesson Plan Subject Name-Theory of Machine Branch - Production Engineering Subject Code - BPE2402 Semester - 4
No ratings yet
Lesson Plan Subject Name-Theory of Machine Branch - Production Engineering Subject Code - BPE2402 Semester - 4
2 pages
Chapter 9
No ratings yet
Chapter 9
34 pages
Chapter 15: Helical, Bevel and Worm Gears: The Main Object of Science Is The Freedom and Happiness of Man
No ratings yet
Chapter 15: Helical, Bevel and Worm Gears: The Main Object of Science Is The Freedom and Happiness of Man
32 pages

Asg One

Uploaded by

Asg One

Uploaded by

1 – Load Dataset

age job marital education default balance housing loan

1 33 services married secondary no 4789 yes yes

2 35 management single tertiary no 1350 yes no

3 30 management married tertiary no 1476 yes yes

4 59 blue-collar married secondary no 0 yes no

contact day month duration campaign pdays previous poutcome

Index(['age', 'default', 'balance', 'housing', 'loan', 'day',

2 – Handle Missing Values

Numeric columns: We use the median, which is robust to outliers.

# Fill numeric with median

# Fill categorical with mode

3 – Boxplots Before Outlier Treatment

import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=len(numeric_cols)//3 + 1, ncols=3,

for i, col in enumerate(numeric_cols):

Calculate Q1 (25th percentile) and Q3 (75th percentile).

Any value outside Q1 - 1.5IQR or Q3 + 1.5IQR is considered an outlier.

Outliers are replaced with the median to minimize skewness.

for col in numeric_cols:

5 – Encode Categorical Variables

Binary categorical features: Label Encoding (0/1).

Multi-class features: One-Hot Encoding, creating dummy variables.

drop_first=True prevents multicollinearity by removing one category.

from sklearn.preprocessing import LabelEncoder

for col in categorical_cols:

Data spread (range, central tendency)

Skewness (positive or negative)

Whether transformations may be needed to normalize the data.

fig, axes = plt.subplots(nrows=len(numeric_cols)//3 + 1, ncols=3,

for i, col in enumerate(numeric_cols):

sns.countplot(x='y', data=df, palette='viridis')

Passing `palette` without assigning `hue` is deprecated and will be

sns.countplot(x='y', data=df, palette='viridis')

import matplotlib.pyplot as plt

# Age distribution of subscribers vs non-subscribers

Values near +1 → strong positive correlation.

Values near -1 → strong negative correlation.

Helps detect redundant features, which may be dropped.

import matplotlib.pyplot as plt

# Select a random subset of 8 numerical columns

# Compute correlation matrix for these columns only

This step helps simplify the model and reduce overfitting.

from sklearn.feature_selection import VarianceThreshold, SelectKBest,

target_col = 'y' if 'y' in df.columns else df.columns[-1]

print("After Variance Threshold:", selected_vt.tolist())

After Variance Threshold: ['age', 'default', 'balance', 'housing',

You might also like