9/6/22, 11:12 AM Dealing with Missing Data - Jupyter Notebook
Dealing with Missing Data
The real-world data often has a lot of missing values. If you want your model to work
unbiased and accurately then you just can’t ignore the part of “missing value” in your
data. One of the most common problems faced in data cleansing or pre-processing is
handling missing values. The purpose of this article is to discover techniques to handle
missing data efficiently.
What is Missing Data?
Missing data means absence of observations in columns. It appears in values such as “0”, “NA”, “NaN”, “NULL”,
“Not Applicable”, “None”.
Why dataset has Missing values?
The cause of it can be data corruption ,failure to record data, lack of information, incomplete results ,person
might not provided the data intentionally ,some system or equipment failure etc. There could any reason for
missing values in your dataset.
Why to handle Missing values?
One of the biggest impact of Missing Data is, it can bias the results of the machine learning models or reduce
the accuracy of the model. So, it is very important to handle missing values.
How to check Missing Data?
The first step in handling missing values is to look at the data carefully and find out all the missing values. In
order to check missing values in Python Pandas Data Frame, we use a function like isnull() and notnull() which
help in checking whether a value is “NaN”(True) or not and return boolean values.
In [2]:
import pandas as pd
import numpy as np
localhost:8888/notebooks/Dealing with Missing Data.ipynb 1/9
9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook
In [5]:
data=pd.read_csv("Datasets/titanic.csv")
data.head()
Out[5]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Ca
Kelly, Mr.
0 892 0 3 male 34.5 0 0 330911 7.8292 N
James
Wilkes,
Mrs.
1 893 1 3 James female 47.0 1 0 363272 7.0000 N
(Ellen
Needs)
Myles, Mr.
2 894 0 2 Thomas male 62.0 0 0 240276 9.6875 N
Francis
Wirz, Mr.
3 895 0 3 male 27.0 0 0 315154 8.6625 N
Albert
Hirvonen,
Mrs.
4 896 1 3 Alexander female 22.0 1 1 3101298 12.2875 N
(Helga E
Lindqvist)
In [7]:
# Checking NULL Value in dataset
data.isnull().sum()
Out[7]:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 86
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 327
Embarked 0
dtype: int64
1. Deleting Rows
This method commonly used to handle the null values. Here, we either delete a particular row if it has a null
value for a particular feature and a particular column if it has more than 70-75% of missing values. This method
localhost:8888/notebooks/Dealing with Missing Data.ipynb 2/9
9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook
is advised only when there are enough samples in the data set. One has to make sure that after we have
deleted the data, there is no addition of bias. Removing the data will lead to loss of information which will not
give the expected results while predicting the output.
In [8]:
data.dropna(inplace=True)
In [9]:
data.isnull().sum()
Out[9]:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 0
dtype: int64
Pros:
* Complete removal of data with missing values results in robust and highly accura
te model
* Deleting a particular row or a column with no specific information is better, si
nce it does not have a high weightage
Cons:
* Loss of information and data
* Works poorly if the percentage of missing values is high (say 30%), compared to
the whole dataset
2. Replacing With Mean/Median/Mode
This strategy can be applied on a feature which has numeric data like the age of a person or the ticket fare. We
can calculate the mean, median or mode of the feature and replace it with the missing values. This is an
approximation which can add variance to the data set. But the loss of the data can be negated by this method
which yields better results compared to removal of rows and columns. Replacing with the above three
approximations are a statistical approach of handling the missing values. This method is also called as leaking
the data while training. Another way is to approximate it with the deviation of neighbouring values. This works
better if the data is linear.
localhost:8888/notebooks/Dealing with Missing Data.ipynb 3/9
9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook
In [11]:
data['Age'].isnull().sum()
Out[11]:
In [12]:
data2=pd.read_csv("Datasets/titanic.csv")
In [13]:
data2['Age'].isnull().sum()
Out[13]:
86
In [14]:
age_mean=data2['Age'].mean()
In [19]:
age_mean
Out[19]:
30.272590361445783
localhost:8888/notebooks/Dealing with Missing Data.ipynb 4/9
9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook
In [16]:
data2['Age'].head(20)
Out[16]:
0 34.5
1 47.0
2 62.0
3 27.0
4 22.0
5 14.0
6 30.0
7 26.0
8 18.0
9 21.0
10 NaN
11 46.0
12 23.0
13 63.0
14 47.0
15 24.0
16 35.0
17 21.0
18 27.0
19 45.0
Name: Age, dtype: float64
In [20]:
data2["Age"].replace(np.NaN,age_mean).head(20)
Out[20]:
0 34.50000
1 47.00000
2 62.00000
3 27.00000
4 22.00000
5 14.00000
6 30.00000
7 26.00000
8 18.00000
9 21.00000
10 30.27259
11 46.00000
12 23.00000
13 63.00000
14 47.00000
15 24.00000
16 35.00000
17 21.00000
18 27.00000
19 45.00000
Name: Age, dtype: float64
To replace it with median and mode we can use the following to calculate the same:
localhost:8888/notebooks/Dealing with Missing Data.ipynb 5/9
9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook
In [21]:
# age_median=data2["Age"].median()
# age_mode=data2["Age"].mode()
3. Assigning An Unique Category
A categorical feature will have a definite number of possibilities, such as gender, for example. Since they have a
definite number of classes, we can assign another class for the missing values. Here, the features Cabin and
Embarked have missing values which can be replaced with a new category, say, U for ‘unknown’. This strategy
will add more information into the dataset which will result in the change of variance. Since they are categorical,
we need to find one hot encoding to convert it to a numeric form for the algorithm to understand it. Let us look at
how it can be done in Python:
localhost:8888/notebooks/Dealing with Missing Data.ipynb 6/9
9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook
In [29]:
data2.head(40)
Out[29]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
Kelly, Mr.
0 892 0 3 male 34.5 0 0 330911 7.
James
Wilkes, Mrs.
James
1 893 1 3 female 47.0 1 0 363272 7.
(Ellen
Needs)
Myles, Mr.
2 894 0 2 Thomas male 62.0 0 0 240276 9.
Francis
Wirz, Mr.
3 895 0 3 male 27.0 0 0 315154 8.
Albert
Hirvonen,
Mrs.
4 896 1 3 Alexander female 22.0 1 1 3101298 12.
(Helga E
Lindqvist)
Svensson,
5 897 0 3 Mr. Johan male 14.0 0 0 7538 9.
Cervin
Connolly,
6 898 1 3 female 30.0 0 0 330972 7.
Miss. Kate
Caldwell,
7 899 0 2 Mr. Albert male 26.0 1 1 248738 29.
Francis
Abrahim,
Mrs. Joseph
8 900 1 3 (Sophie female 18.0 0 0 2657 7.
Halaut
Easu)
Davies, Mr.
9 901 0 3 John male 21.0 2 0 A/4 48871 24.
Samuel
Ilieff, Mr.
10 902 0 3 male NaN 0 0 349220 7.
Ylio
Jones, Mr.
11 903 0 1 Charles male 46.0 0 0 694 26.
Cresson
Snyder, Mrs.
John
12 904 1 1 Pillsbury female 23.0 1 0 21228 82.
(Nelle
Stevenson)
Howard, Mr.
13 905 0 2 male 63.0 1 0 24065 26.
Benjamin
Chaffee,
Mrs. Herbert
W.E.P.
14 906 1 1 Fuller female 47.0 1 0 61.
5734
(Carrie
Constance...
localhost:8888/notebooks/Dealing with Missing Data.ipynb 7/9
9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
del Carlo,
Mrs.
SC/PARIS
15 907 1 2 Sebastiano female 24.0 1 0 27.
2167
(Argenia
Genovesi)
Keane, Mr.
16 908 0 2 male 35.0 0 0 233734 12.
Daniel
Assaf, Mr.
17 909 0 3 male 21.0 0 0 2692 7.
Gerios
Ilmakangas,
STON/O2.
18 910 1 3 Miss. Ida female 27.0 1 0 7.
3101270
Livija
Assaf Khalil,
Mrs.
19 911 1 3 female 45.0 0 0 2696 7.
Mariana
(Miriam")"
Rothschild,
20 912 0 1 male 55.0 1 0 PC 17603 59.
Mr. Martin
Olsen,
21 913 0 3 Master. male 9.0 0 1 C 17368 3.
Artur Karl
Flegenheim,
22 914 1 1 Mrs. Alfred female NaN 0 0 PC 17598 31.
(Antoinette)
Williams, Mr.
23 915 0 1 Richard male 21.0 0 1 PC 17597 61.
Norris II
Ryerson,
Mrs. Arthur
24 916 1 1 Larned female 48.0 1 3 PC 17608 262.
(Emily Maria
Borie)
Robins, Mr.
25 917 0 3 male 50.0 1 0 A/5. 3337 14.
Alexander A
Ostby, Miss.
26 918 1 1 Helene female 22.0 0 1 113509 61.
Ragnhild
Daher, Mr.
27 919 0 3 male 22.5 0 0 2698 7.
Shedid
Brady, Mr.
28 920 0 1 John male 41.0 0 0 113054 30.
Bertram
Samaan, Mr.
29 921 0 3 male NaN 2 0 2662 21.
Elias
Louch, Mr.
SC/AH
30 922 0 2 Charles male 50.0 1 0 26.
3085
Alexander
Jefferys, Mr.
C.A.
31 923 0 2 Clifford male 24.0 2 0 31.
31029
Thomas
Dean, Mrs.
Bertram
32 924 1 3 (Eva female 33.0 1 2 C.A. 2315 20.
Georgetta
Light)
localhost:8888/notebooks/Dealing with Missing Data.ipynb 8/9
9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
Johnston,
Mrs. Andrew
W./C.
33 925 1 3 G (Elizabeth female NaN 1 2 23.
6607
Lily"
Watson)"
Mock, Mr.
34 926 0 1 Philipp male 30.0 1 0 13236 57.
Edmund
Katavelas,
Mr. Vassilios
35 927 0 3 male 18.5 0 0 2682 7.
(Catavelas
Vassilios")"
Roth, Miss.
36 928 1 3 female NaN 0 0 342712 8.
Sarah A
Cacic, Miss.
37 929 1 3 female 21.0 0 0 315087 8.
Manda
Sap, Mr.
38 930 0 3 male 25.0 0 0 345768 9.
Julius
Hee, Mr.
39 931 0 3 male NaN 0 0 1601 56.
Ling
In [31]:
data2['Cabin'].fillna('New Cabin').head(15)
Out[31]:
0 New Cabin
1 New Cabin
2 New Cabin
3 New Cabin
4 New Cabin
5 New Cabin
6 New Cabin
7 New Cabin
8 New Cabin
9 New Cabin
10 New Cabin
11 New Cabin
12 B45
13 New Cabin
14 E31
Name: Cabin, dtype: object
In [ ]:
localhost:8888/notebooks/Dealing with Missing Data.ipynb 9/9