12/4/23, 7:28 PM Data Cleaning Project 1st Draft.
ipynb - Colaboratory
import pandas as pd
Read the CSV File
df = pd.read_csv('/content/Housing_Data_Set (1).csv')
df
output price area bedrooms bathrooms stories mainroad guestroom basement hotwaterheating airconditioning parking prefar
0 13300000.0 7420 4.0 2 3.0 yes no no no yes 2 y
1 12250000.0 8960 4.0 4 4.0 yes no no no yes 3
2 12250000.0 9960 3.0 2 2.0 yes no yes no no 2 y
3 12215000.0 7500 4.0 2 2.0 yes no yes no yes 3 y
4 11410000.0 7420 4.0 1 2.0 yes yes yes no yes 2
... ... ... ... ... ... ... ... ... ... ... ...
df.head()
price area bedrooms bathrooms stories mainroad guestroom basement hotwaterheating airconditioning parking prefarea
0 13300000.0 7420 4.0 2 3.0 yes no no no yes 2 yes
1 12250000.0 8960 4.0 4 4.0 yes no no no yes 3 no
Getting information about the Dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 538 non-null float64
1 area 545 non-null int64
2 bedrooms 544 non-null float64
3 bathrooms 545 non-null int64
4 stories 543 non-null float64
5 mainroad 545 non-null object
6 guestroom 543 non-null object
7 basement 545 non-null object
8 hotwaterheating 537 non-null object
9 airconditioning 544 non-null object
10 parking 545 non-null int64
11 prefarea 545 non-null object
12 furnishingstatus 545 non-null object
13 Date 542 non-null object
dtypes: float64(3), int64(3), object(8)
memory usage: 59.7+ KB
Describing the Dataset
df.describe()
https://colab.research.google.com/drive/1HU684UqeRwNdBFPV2X1bDAnDeztx5GVD#scrollTo=W2c02AzKM3kZ&printMode=true 1/7
12/4/23, 7:28 PM Data Cleaning Project 1st Draft.ipynb - Colaboratory
price area bedrooms bathrooms stories parking
count 5.380000e+02 545.000000 544.000000 545.000000 543.000000 545.000000
mean 4.779255e+06 5150.541284 2.966912 1.286239 2.069982 0.693578
std 1.876768e+06 2170.141023 0.737579 0.502470 4.996187 0.861586
min 1.750000e+06 1650.000000 1.000000 1.000000 1.000000 0.000000
25% 3.438750e+06 3600.000000 2.000000 1.000000 1.000000 0.000000
50% 4.340000e+06 4600.000000 3.000000 1.000000 2.000000 0.000000
75% 5.796000e+06 6360.000000 3.000000 2.000000 2.000000 1.000000
max 1.330000e+07 16200.000000 6.000000 4.000000 110.000000 3.000000
Finding the Null Values in the Dataset
df.isnull().sum()
price 7
area 0
bedrooms 1
bathrooms 0
stories 2
mainroad 0
guestroom 2
basement 0
hotwaterheating 8
airconditioning 1
parking 0
prefarea 0
furnishingstatus 0
Date 3
dtype: int64
Dropping the unnecessary columns in the Dataste
df = df.drop(columns = 'Date')
df.head()
price area bedrooms bathrooms stories mainroad guestroom basement hotw
0 13300000.0 7420 4.0 2 3.0 yes no no
1 12250000.0 8960 4.0 4 4.0 yes no no
2 12250000.0 9960 3.0 2 2.0 yes no yes
3 12215000.0 7500 4.0 2 2.0 yes no yes
Fill the null values in the Data using the mean average method
df['price'].fillna(df["price"].mean(), inplace=True)
df.isnull().sum()
price 0
area 0
bedrooms 1
bathrooms 0
stories 2
mainroad 0
guestroom 2
basement 0
hotwaterheating 8
airconditioning 1
parking 0
prefarea 0
furnishingstatus 0
dtype: int64
https://colab.research.google.com/drive/1HU684UqeRwNdBFPV2X1bDAnDeztx5GVD#scrollTo=W2c02AzKM3kZ&printMode=true 2/7
12/4/23, 7:28 PM Data Cleaning Project 1st Draft.ipynb - Colaboratory
df['price'].fillna(df["price"].mean(), inplace=True)
df
price area bedrooms bathrooms stories mainroad guestroom basement ho
0 13300000.0 7420 4.0 2 3.0 yes no no
1 12250000.0 8960 4.0 4 4.0 yes no no
2 12250000.0 9960 3.0 2 2.0 yes no yes
3 12215000.0 7500 4.0 2 2.0 yes no yes
4 11410000.0 7420 4.0 1 2.0 yes yes yes
... ... ... ... ... ... ... ... ...
540 1820000.0 3000 2.0 1 1.0 yes no yes
541 1767150.0 2400 3.0 1 1.0 no no no
542 1750000.0 3620 2.0 1 1.0 yes no no
543 1750000.0 2910 3.0 1 1.0 no no no
544 1750000.0 3850 3.0 1 2.0 yes no no
df['bedrooms'].fillna(df["bedrooms"].mean(), inplace=True)
df.isnull().sum()
price 0
area 0
bedrooms 0
bathrooms 0
stories 2
mainroad 0
guestroom 2
basement 0
hotwaterheating 8
airconditioning 1
parking 0
prefarea 0
furnishingstatus 0
dtype: int64
df['stories'].fillna(df["stories"].mean(), inplace=True)
df.isnull().sum()
price 0
area 0
bedrooms 0
bathrooms 0
stories 0
mainroad 0
guestroom 2
basement 0
hotwaterheating 8
airconditioning 1
parking 0
prefarea 0
furnishingstatus 0
dtype: int64
Fill the null values by getting the Mode of the Column
df['guestroom'].fillna(df['guestroom'].mode().iloc[0], inplace=True)
df.isnull().sum()
price 0
area 0
bedrooms 0
bathrooms 0
stories 0
mainroad 0
https://colab.research.google.com/drive/1HU684UqeRwNdBFPV2X1bDAnDeztx5GVD#scrollTo=W2c02AzKM3kZ&printMode=true 3/7
12/4/23, 7:28 PM Data Cleaning Project 1st Draft.ipynb - Colaboratory
guestroom 0
basement 0
hotwaterheating 8
airconditioning 1
parking 0
prefarea 0
furnishingstatus 0
dtype: int64
df['hotwaterheating'].fillna(df['hotwaterheating'].mode().iloc[0], inplace=True)
df.isnull().sum()
price 0
area 0
bedrooms 0
bathrooms 0
stories 0
mainroad 0
guestroom 0
basement 0
hotwaterheating 0
airconditioning 1
parking 0
prefarea 0
furnishingstatus 0
dtype: int64
df['airconditioning'].fillna(df['airconditioning'].mode().iloc[0], inplace=True)
df.isnull().sum()
price 0
area 0
bedrooms 0
bathrooms 0
stories 0
mainroad 0
guestroom 0
basement 0
hotwaterheating 0
airconditioning 0
parking 0
prefarea 0
furnishingstatus 0
dtype: int64
Sorting the Dataset according to the "furnishingstatus" column
df = df.sort_values('furnishingstatus')
df.head()
price area bedrooms bathrooms stories mainroad guestroom basement ho
0 13300000.0 7420 4.0 2 3.0 yes no no
365 3703000.0 5450 2.0 1 1.0 yes no no
124 5950000.0 6525 3.0 2 4.0 yes no no
362 3710000.0 4050 2.0 1 1.0 yes no no
Rephrasing the Dataset
df = df.reset_index()
df
https://colab.research.google.com/drive/1HU684UqeRwNdBFPV2X1bDAnDeztx5GVD#scrollTo=W2c02AzKM3kZ&printMode=true 4/7
12/4/23, 7:28 PM Data Cleaning Project 1st Draft.ipynb - Colaboratory
index price area bedrooms bathrooms stories mainroad guestroom basement hotwaterheating airconditioning parking
0 0 13300000.0 7420 4.0 2 3.0 yes no no no yes 2
1 365 3703000.0 5450 2.0 1 1.0 yes no no no no 0
2 124 5950000.0 6525 3.0 2 4.0 yes no no no no 1
3 362 3710000.0 4050 2.0 1 1.0 yes no no no no 0
4 128 5873000.0 5500 3.0 1 3.0 yes yes no no yes 1
... ... ... ... ... ... ... ... ... ... ... ... ...
540 405 3465000.0 3060 3.0 1 1.0 yes no no no no 0
541 406 3465000.0 5320 2.0 1 1.0 yes no no no no 1
df = df.drop(columns
542 408 = 'index')
3430000.0 4000 2.0 1 1.0 yes no no no no 0
543 410 3430000.0 3850 3.0 1 1.0 yes no no no no 0
df.head()
544 544 1750000.0 3850 3.0 1 2.0 yes no no no no 0
price area bedrooms bathrooms stories mainroad guestroom basement hotwaterheating airconditioning parking prefarea
0 13300000.0 7420 4.0 2 3.0 yes no no no yes 2 yes
1 3703000.0 5450 2.0 1 1.0 yes no no no no 0 no
2 5950000.0 6525 3.0 2 4.0 yes no no no no 1 no
3 3710000.0 4050 2.0 1 1.0 yes no no no no 0 no
Checking if the Dataset is clean or not
df.isnull().sum()
price 0
area 0
bedrooms 0
bathrooms 0
stories 0
mainroad 0
guestroom 0
basement 0
hotwaterheating 0
airconditioning 0
parking 0
prefarea 0
furnishingstatus 0
dtype: int64
EDA(Exploratory Data Analysis) of the Dataset
import matplotlib.pyplot as plt
plt.scatter(df["price"], df["area"])
https://colab.research.google.com/drive/1HU684UqeRwNdBFPV2X1bDAnDeztx5GVD#scrollTo=W2c02AzKM3kZ&printMode=true 5/7
12/4/23, 7:28 PM Data Cleaning Project 1st Draft.ipynb - Colaboratory
<matplotlib.collections.PathCollection at 0x7a4a6102f010>
import seaborn as sns
sns.pairplot(df)
https://colab.research.google.com/drive/1HU684UqeRwNdBFPV2X1bDAnDeztx5GVD#scrollTo=W2c02AzKM3kZ&printMode=true 6/7
12/4/23, 7:28 PM Data Cleaning Project 1st Draft.ipynb - Colaboratory
<seaborn.axisgrid.PairGrid at 0x7a4a6109f880>
Getting the Correlation of the Data and plotting it on the Heatmap
correlation_matrix = df[['price', 'area']].corr()
sns.set(style="darkgrid")
sns.heatmap(correlation_matrix, annot=True, cmap='magma', fmt=".2f", linewidths=.5)
<Axes: >
sns.lmplot(x='price', y='area', data=df)
<seaborn.axisgrid.FacetGrid at 0x7a4a5c071e40>
https://colab.research.google.com/drive/1HU684UqeRwNdBFPV2X1bDAnDeztx5GVD#scrollTo=W2c02AzKM3kZ&printMode=true 7/7