Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views9 pages

Dealing With Missing Data - Jupyter Notebook

The document discusses the importance of handling missing data in datasets, emphasizing that ignoring missing values can lead to biased results in machine learning models. It outlines various techniques for dealing with missing data, including deleting rows, replacing missing values with mean/median/mode, and assigning unique categories for categorical features. The document also provides practical examples using Python's Pandas library to demonstrate these methods.

Uploaded by

kunal.sah.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views9 pages

Dealing With Missing Data - Jupyter Notebook

The document discusses the importance of handling missing data in datasets, emphasizing that ignoring missing values can lead to biased results in machine learning models. It outlines various techniques for dealing with missing data, including deleting rows, replacing missing values with mean/median/mode, and assigning unique categories for categorical features. The document also provides practical examples using Python's Pandas library to demonstrate these methods.

Uploaded by

kunal.sah.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

9/6/22, 11:12 AM Dealing with Missing Data - Jupyter Notebook

Dealing with Missing Data


The real-world data often has a lot of missing values. If you want your model to work
unbiased and accurately then you just can’t ignore the part of “missing value” in your
data. One of the most common problems faced in data cleansing or pre-processing is
handling missing values. The purpose of this article is to discover techniques to handle
missing data efficiently.

What is Missing Data?


Missing data means absence of observations in columns. It appears in values such as “0”, “NA”, “NaN”, “NULL”,
“Not Applicable”, “None”.

Why dataset has Missing values?


The cause of it can be data corruption ,failure to record data, lack of information, incomplete results ,person
might not provided the data intentionally ,some system or equipment failure etc. There could any reason for
missing values in your dataset.

Why to handle Missing values?


One of the biggest impact of Missing Data is, it can bias the results of the machine learning models or reduce
the accuracy of the model. So, it is very important to handle missing values.

How to check Missing Data?


The first step in handling missing values is to look at the data carefully and find out all the missing values. In
order to check missing values in Python Pandas Data Frame, we use a function like isnull() and notnull() which
help in checking whether a value is “NaN”(True) or not and return boolean values.

In [2]:

import pandas as pd
import numpy as np

localhost:8888/notebooks/Dealing with Missing Data.ipynb 1/9


9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook

In [5]:

data=pd.read_csv("Datasets/titanic.csv")
data.head()

Out[5]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Ca

Kelly, Mr.
0 892 0 3 male 34.5 0 0 330911 7.8292 N
James

Wilkes,
Mrs.
1 893 1 3 James female 47.0 1 0 363272 7.0000 N
(Ellen
Needs)

Myles, Mr.
2 894 0 2 Thomas male 62.0 0 0 240276 9.6875 N
Francis

Wirz, Mr.
3 895 0 3 male 27.0 0 0 315154 8.6625 N
Albert

Hirvonen,
Mrs.
4 896 1 3 Alexander female 22.0 1 1 3101298 12.2875 N
(Helga E
Lindqvist)

In [7]:

# Checking NULL Value in dataset

data.isnull().sum()

Out[7]:

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 86
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 327
Embarked 0
dtype: int64

1. Deleting Rows
This method commonly used to handle the null values. Here, we either delete a particular row if it has a null
value for a particular feature and a particular column if it has more than 70-75% of missing values. This method

localhost:8888/notebooks/Dealing with Missing Data.ipynb 2/9


9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook

is advised only when there are enough samples in the data set. One has to make sure that after we have
deleted the data, there is no addition of bias. Removing the data will lead to loss of information which will not
give the expected results while predicting the output.

In [8]:

data.dropna(inplace=True)

In [9]:

data.isnull().sum()

Out[9]:

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 0
dtype: int64

Pros:

* Complete removal of data with missing values results in robust and highly accura
te model

* Deleting a particular row or a column with no specific information is better, si


nce it does not have a high weightage

Cons:

* Loss of information and data


* Works poorly if the percentage of missing values is high (say 30%), compared to
the whole dataset

2. Replacing With Mean/Median/Mode


This strategy can be applied on a feature which has numeric data like the age of a person or the ticket fare. We
can calculate the mean, median or mode of the feature and replace it with the missing values. This is an
approximation which can add variance to the data set. But the loss of the data can be negated by this method
which yields better results compared to removal of rows and columns. Replacing with the above three
approximations are a statistical approach of handling the missing values. This method is also called as leaking
the data while training. Another way is to approximate it with the deviation of neighbouring values. This works
better if the data is linear.

localhost:8888/notebooks/Dealing with Missing Data.ipynb 3/9


9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook

In [11]:

data['Age'].isnull().sum()

Out[11]:

In [12]:

data2=pd.read_csv("Datasets/titanic.csv")

In [13]:

data2['Age'].isnull().sum()

Out[13]:

86

In [14]:

age_mean=data2['Age'].mean()

In [19]:

age_mean

Out[19]:

30.272590361445783

localhost:8888/notebooks/Dealing with Missing Data.ipynb 4/9


9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook

In [16]:

data2['Age'].head(20)

Out[16]:

0 34.5
1 47.0
2 62.0
3 27.0
4 22.0
5 14.0
6 30.0
7 26.0
8 18.0
9 21.0
10 NaN
11 46.0
12 23.0
13 63.0
14 47.0
15 24.0
16 35.0
17 21.0
18 27.0
19 45.0
Name: Age, dtype: float64

In [20]:

data2["Age"].replace(np.NaN,age_mean).head(20)

Out[20]:

0 34.50000
1 47.00000
2 62.00000
3 27.00000
4 22.00000
5 14.00000
6 30.00000
7 26.00000
8 18.00000
9 21.00000
10 30.27259
11 46.00000
12 23.00000
13 63.00000
14 47.00000
15 24.00000
16 35.00000
17 21.00000
18 27.00000
19 45.00000
Name: Age, dtype: float64

To replace it with median and mode we can use the following to calculate the same:

localhost:8888/notebooks/Dealing with Missing Data.ipynb 5/9


9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook

In [21]:

# age_median=data2["Age"].median()
# age_mode=data2["Age"].mode()

3. Assigning An Unique Category


A categorical feature will have a definite number of possibilities, such as gender, for example. Since they have a
definite number of classes, we can assign another class for the missing values. Here, the features Cabin and
Embarked have missing values which can be replaced with a new category, say, U for ‘unknown’. This strategy
will add more information into the dataset which will result in the change of variance. Since they are categorical,
we need to find one hot encoding to convert it to a numeric form for the algorithm to understand it. Let us look at
how it can be done in Python:

localhost:8888/notebooks/Dealing with Missing Data.ipynb 6/9


9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook

In [29]:

data2.head(40)

Out[29]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket

Kelly, Mr.
0 892 0 3 male 34.5 0 0 330911 7.
James

Wilkes, Mrs.
James
1 893 1 3 female 47.0 1 0 363272 7.
(Ellen
Needs)

Myles, Mr.
2 894 0 2 Thomas male 62.0 0 0 240276 9.
Francis

Wirz, Mr.
3 895 0 3 male 27.0 0 0 315154 8.
Albert

Hirvonen,
Mrs.
4 896 1 3 Alexander female 22.0 1 1 3101298 12.
(Helga E
Lindqvist)

Svensson,
5 897 0 3 Mr. Johan male 14.0 0 0 7538 9.
Cervin

Connolly,
6 898 1 3 female 30.0 0 0 330972 7.
Miss. Kate

Caldwell,
7 899 0 2 Mr. Albert male 26.0 1 1 248738 29.
Francis

Abrahim,
Mrs. Joseph
8 900 1 3 (Sophie female 18.0 0 0 2657 7.
Halaut
Easu)

Davies, Mr.
9 901 0 3 John male 21.0 2 0 A/4 48871 24.
Samuel

Ilieff, Mr.
10 902 0 3 male NaN 0 0 349220 7.
Ylio

Jones, Mr.
11 903 0 1 Charles male 46.0 0 0 694 26.
Cresson

Snyder, Mrs.
John
12 904 1 1 Pillsbury female 23.0 1 0 21228 82.
(Nelle
Stevenson)

Howard, Mr.
13 905 0 2 male 63.0 1 0 24065 26.
Benjamin

Chaffee,
Mrs. Herbert
W.E.P.
14 906 1 1 Fuller female 47.0 1 0 61.
5734
(Carrie
Constance...

localhost:8888/notebooks/Dealing with Missing Data.ipynb 7/9


9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket

del Carlo,
Mrs.
SC/PARIS
15 907 1 2 Sebastiano female 24.0 1 0 27.
2167
(Argenia
Genovesi)

Keane, Mr.
16 908 0 2 male 35.0 0 0 233734 12.
Daniel

Assaf, Mr.
17 909 0 3 male 21.0 0 0 2692 7.
Gerios

Ilmakangas,
STON/O2.
18 910 1 3 Miss. Ida female 27.0 1 0 7.
3101270
Livija

Assaf Khalil,
Mrs.
19 911 1 3 female 45.0 0 0 2696 7.
Mariana
(Miriam")"

Rothschild,
20 912 0 1 male 55.0 1 0 PC 17603 59.
Mr. Martin

Olsen,
21 913 0 3 Master. male 9.0 0 1 C 17368 3.
Artur Karl

Flegenheim,
22 914 1 1 Mrs. Alfred female NaN 0 0 PC 17598 31.
(Antoinette)

Williams, Mr.
23 915 0 1 Richard male 21.0 0 1 PC 17597 61.
Norris II

Ryerson,
Mrs. Arthur
24 916 1 1 Larned female 48.0 1 3 PC 17608 262.
(Emily Maria
Borie)

Robins, Mr.
25 917 0 3 male 50.0 1 0 A/5. 3337 14.
Alexander A

Ostby, Miss.
26 918 1 1 Helene female 22.0 0 1 113509 61.
Ragnhild

Daher, Mr.
27 919 0 3 male 22.5 0 0 2698 7.
Shedid

Brady, Mr.
28 920 0 1 John male 41.0 0 0 113054 30.
Bertram

Samaan, Mr.
29 921 0 3 male NaN 2 0 2662 21.
Elias

Louch, Mr.
SC/AH
30 922 0 2 Charles male 50.0 1 0 26.
3085
Alexander

Jefferys, Mr.
C.A.
31 923 0 2 Clifford male 24.0 2 0 31.
31029
Thomas

Dean, Mrs.
Bertram
32 924 1 3 (Eva female 33.0 1 2 C.A. 2315 20.
Georgetta
Light)

localhost:8888/notebooks/Dealing with Missing Data.ipynb 8/9


9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket

Johnston,
Mrs. Andrew
W./C.
33 925 1 3 G (Elizabeth female NaN 1 2 23.
6607
Lily"
Watson)"

Mock, Mr.
34 926 0 1 Philipp male 30.0 1 0 13236 57.
Edmund

Katavelas,
Mr. Vassilios
35 927 0 3 male 18.5 0 0 2682 7.
(Catavelas
Vassilios")"

Roth, Miss.
36 928 1 3 female NaN 0 0 342712 8.
Sarah A

Cacic, Miss.
37 929 1 3 female 21.0 0 0 315087 8.
Manda

Sap, Mr.
38 930 0 3 male 25.0 0 0 345768 9.
Julius

Hee, Mr.
39 931 0 3 male NaN 0 0 1601 56.
Ling

In [31]:

data2['Cabin'].fillna('New Cabin').head(15)

Out[31]:

0 New Cabin
1 New Cabin
2 New Cabin
3 New Cabin
4 New Cabin
5 New Cabin
6 New Cabin
7 New Cabin
8 New Cabin
9 New Cabin
10 New Cabin
11 New Cabin
12 B45
13 New Cabin
14 E31
Name: Cabin, dtype: object

In [ ]:

localhost:8888/notebooks/Dealing with Missing Data.ipynb 9/9

You might also like