0% found this document useful (0 votes)

7 views9 pages

Dealing With Missing Data - Jupyter Notebook

The document discusses the importance of handling missing data in datasets, emphasizing that ignoring missing values can lead to biased results in machine learning models. It outlines various techniques for dealing with missing data, including deleting rows, replacing missing values with mean/median/mode, and assigning unique categories for categorical features. The document also provides practical examples using Python's Pandas library to demonstrate these methods.

Uploaded by

kunal.sah.cse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views9 pages

Dealing With Missing Data - Jupyter Notebook

Uploaded by

kunal.sah.cse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

9/6/22, 11:12 AM Dealing with Missing Data - Jupyter Notebook

Dealing with Missing Data

The real-world data often has a lot of missing values. If you want your model to work
unbiased and accurately then you just can’t ignore the part of “missing value” in your
data. One of the most common problems faced in data cleansing or pre-processing is
handling missing values. The purpose of this article is to discover techniques to handle
missing data efficiently.

What is Missing Data?

Missing data means absence of observations in columns. It appears in values such as “0”, “NA”, “NaN”, “NULL”,
“Not Applicable”, “None”.

Why dataset has Missing values?

The cause of it can be data corruption ,failure to record data, lack of information, incomplete results ,person
might not provided the data intentionally ,some system or equipment failure etc. There could any reason for
missing values in your dataset.

Why to handle Missing values?

One of the biggest impact of Missing Data is, it can bias the results of the machine learning models or reduce
the accuracy of the model. So, it is very important to handle missing values.

How to check Missing Data?

The first step in handling missing values is to look at the data carefully and find out all the missing values. In
order to check missing values in Python Pandas Data Frame, we use a function like isnull() and notnull() which
help in checking whether a value is “NaN”(True) or not and return boolean values.

In [2]:

import pandas as pd
import numpy as np

localhost:8888/notebooks/Dealing with Missing Data.ipynb 1/9

9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook

In [5]:

data=pd.read_csv("Datasets/titanic.csv")
data.head()

Out[5]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Ca

Kelly, Mr.
0 892 0 3 male 34.5 0 0 330911 7.8292 N
James

Wilkes,
Mrs.
1 893 1 3 James female 47.0 1 0 363272 7.0000 N
(Ellen
Needs)

Myles, Mr.
2 894 0 2 Thomas male 62.0 0 0 240276 9.6875 N
Francis

Wirz, Mr.
3 895 0 3 male 27.0 0 0 315154 8.6625 N
Albert

Hirvonen,
Mrs.
4 896 1 3 Alexander female 22.0 1 1 3101298 12.2875 N
(Helga E
Lindqvist)

In [7]:

# Checking NULL Value in dataset

data.isnull().sum()

Out[7]:

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 86
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 327
Embarked 0
dtype: int64

1. Deleting Rows
This method commonly used to handle the null values. Here, we either delete a particular row if it has a null
value for a particular feature and a particular column if it has more than 70-75% of missing values. This method

localhost:8888/notebooks/Dealing with Missing Data.ipynb 2/9

9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook

is advised only when there are enough samples in the data set. One has to make sure that after we have
deleted the data, there is no addition of bias. Removing the data will lead to loss of information which will not
give the expected results while predicting the output.

In [8]:

data.dropna(inplace=True)

In [9]:

data.isnull().sum()

Out[9]:

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 0
dtype: int64

Pros:

* Complete removal of data with missing values results in robust and highly accura
te model

* Deleting a particular row or a column with no specific information is better, si

nce it does not have a high weightage

Cons:

* Loss of information and data

* Works poorly if the percentage of missing values is high (say 30%), compared to
the whole dataset

2. Replacing With Mean/Median/Mode

This strategy can be applied on a feature which has numeric data like the age of a person or the ticket fare. We
can calculate the mean, median or mode of the feature and replace it with the missing values. This is an
approximation which can add variance to the data set. But the loss of the data can be negated by this method
which yields better results compared to removal of rows and columns. Replacing with the above three
approximations are a statistical approach of handling the missing values. This method is also called as leaking
the data while training. Another way is to approximate it with the deviation of neighbouring values. This works
better if the data is linear.

localhost:8888/notebooks/Dealing with Missing Data.ipynb 3/9

9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook

In [11]:

data['Age'].isnull().sum()

Out[11]:

In [12]:

data2=pd.read_csv("Datasets/titanic.csv")

In [13]:

data2['Age'].isnull().sum()

Out[13]:

In [14]:

age_mean=data2['Age'].mean()

In [19]:

age_mean

Out[19]:

30.272590361445783

localhost:8888/notebooks/Dealing with Missing Data.ipynb 4/9

9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook

In [16]:

data2['Age'].head(20)

Out[16]:

0 34.5
1 47.0
2 62.0
3 27.0
4 22.0
5 14.0
6 30.0
7 26.0
8 18.0
9 21.0
10 NaN
11 46.0
12 23.0
13 63.0
14 47.0
15 24.0
16 35.0
17 21.0
18 27.0
19 45.0
Name: Age, dtype: float64

In [20]:

data2["Age"].replace(np.NaN,age_mean).head(20)

Out[20]:

0 34.50000
1 47.00000
2 62.00000
3 27.00000
4 22.00000
5 14.00000
6 30.00000
7 26.00000
8 18.00000
9 21.00000
10 30.27259
11 46.00000
12 23.00000
13 63.00000
14 47.00000
15 24.00000
16 35.00000
17 21.00000
18 27.00000
19 45.00000
Name: Age, dtype: float64

To replace it with median and mode we can use the following to calculate the same:

localhost:8888/notebooks/Dealing with Missing Data.ipynb 5/9

9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook

In [21]:

# age_median=data2["Age"].median()
# age_mode=data2["Age"].mode()

3. Assigning An Unique Category

A categorical feature will have a definite number of possibilities, such as gender, for example. Since they have a
definite number of classes, we can assign another class for the missing values. Here, the features Cabin and
Embarked have missing values which can be replaced with a new category, say, U for ‘unknown’. This strategy
will add more information into the dataset which will result in the change of variance. Since they are categorical,
we need to find one hot encoding to convert it to a numeric form for the algorithm to understand it. Let us look at
how it can be done in Python:

localhost:8888/notebooks/Dealing with Missing Data.ipynb 6/9

9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook

In [29]:

data2.head(40)

Out[29]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket

Kelly, Mr.
0 892 0 3 male 34.5 0 0 330911 7.
James

Wilkes, Mrs.
James
1 893 1 3 female 47.0 1 0 363272 7.
(Ellen
Needs)

Myles, Mr.
2 894 0 2 Thomas male 62.0 0 0 240276 9.
Francis

Wirz, Mr.
3 895 0 3 male 27.0 0 0 315154 8.
Albert

Hirvonen,
Mrs.
4 896 1 3 Alexander female 22.0 1 1 3101298 12.
(Helga E
Lindqvist)

Svensson,
5 897 0 3 Mr. Johan male 14.0 0 0 7538 9.
Cervin

Connolly,
6 898 1 3 female 30.0 0 0 330972 7.
Miss. Kate

Caldwell,
7 899 0 2 Mr. Albert male 26.0 1 1 248738 29.
Francis

Abrahim,
Mrs. Joseph
8 900 1 3 (Sophie female 18.0 0 0 2657 7.
Halaut
Easu)

Davies, Mr.
9 901 0 3 John male 21.0 2 0 A/4 48871 24.
Samuel

Ilieff, Mr.
10 902 0 3 male NaN 0 0 349220 7.
Ylio

Jones, Mr.
11 903 0 1 Charles male 46.0 0 0 694 26.
Cresson

Snyder, Mrs.
John
12 904 1 1 Pillsbury female 23.0 1 0 21228 82.
(Nelle
Stevenson)

Howard, Mr.
13 905 0 2 male 63.0 1 0 24065 26.
Benjamin

Chaffee,
Mrs. Herbert
W.E.P.
14 906 1 1 Fuller female 47.0 1 0 61.
5734
(Carrie
Constance...

localhost:8888/notebooks/Dealing with Missing Data.ipynb 7/9

9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket

del Carlo,
Mrs.
SC/PARIS
15 907 1 2 Sebastiano female 24.0 1 0 27.
2167
(Argenia
Genovesi)

Keane, Mr.
16 908 0 2 male 35.0 0 0 233734 12.
Daniel

Assaf, Mr.
17 909 0 3 male 21.0 0 0 2692 7.
Gerios

Ilmakangas,
STON/O2.
18 910 1 3 Miss. Ida female 27.0 1 0 7.
3101270
Livija

Assaf Khalil,
Mrs.
19 911 1 3 female 45.0 0 0 2696 7.
Mariana
(Miriam")"

Rothschild,
20 912 0 1 male 55.0 1 0 PC 17603 59.
Mr. Martin

Olsen,
21 913 0 3 Master. male 9.0 0 1 C 17368 3.
Artur Karl

Flegenheim,
22 914 1 1 Mrs. Alfred female NaN 0 0 PC 17598 31.
(Antoinette)

Williams, Mr.
23 915 0 1 Richard male 21.0 0 1 PC 17597 61.
Norris II

Ryerson,
Mrs. Arthur
24 916 1 1 Larned female 48.0 1 3 PC 17608 262.
(Emily Maria
Borie)

Robins, Mr.
25 917 0 3 male 50.0 1 0 A/5. 3337 14.
Alexander A

Ostby, Miss.
26 918 1 1 Helene female 22.0 0 1 113509 61.
Ragnhild

Daher, Mr.
27 919 0 3 male 22.5 0 0 2698 7.
Shedid

Brady, Mr.
28 920 0 1 John male 41.0 0 0 113054 30.
Bertram

Samaan, Mr.
29 921 0 3 male NaN 2 0 2662 21.
Elias

Louch, Mr.
SC/AH
30 922 0 2 Charles male 50.0 1 0 26.
3085
Alexander

Jefferys, Mr.
C.A.
31 923 0 2 Clifford male 24.0 2 0 31.
31029
Thomas

Dean, Mrs.
Bertram
32 924 1 3 (Eva female 33.0 1 2 C.A. 2315 20.
Georgetta
Light)

localhost:8888/notebooks/Dealing with Missing Data.ipynb 8/9

9/6/22, 11:13 AM Dealing with Missing Data - Jupyter Notebook

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket

Johnston,
Mrs. Andrew
W./C.
33 925 1 3 G (Elizabeth female NaN 1 2 23.
6607
Lily"
Watson)"

Mock, Mr.
34 926 0 1 Philipp male 30.0 1 0 13236 57.
Edmund

Katavelas,
Mr. Vassilios
35 927 0 3 male 18.5 0 0 2682 7.
(Catavelas
Vassilios")"

Roth, Miss.
36 928 1 3 female NaN 0 0 342712 8.
Sarah A

Cacic, Miss.
37 929 1 3 female 21.0 0 0 315087 8.
Manda

Sap, Mr.
38 930 0 3 male 25.0 0 0 345768 9.
Julius

Hee, Mr.
39 931 0 3 male NaN 0 0 1601 56.
Ling

In [31]:

data2['Cabin'].fillna('New Cabin').head(15)

Out[31]:

0 New Cabin
1 New Cabin
2 New Cabin
3 New Cabin
4 New Cabin
5 New Cabin
6 New Cabin
7 New Cabin
8 New Cabin
9 New Cabin
10 New Cabin
11 New Cabin
12 B45
13 New Cabin
14 E31
Name: Cabin, dtype: object

In [ ]:

localhost:8888/notebooks/Dealing with Missing Data.ipynb 9/9

Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Rajasthan Basin
No ratings yet
Rajasthan Basin
239 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
47 pages
Missing Data
No ratings yet
Missing Data
14 pages
Column Base Plate Calculation Report
No ratings yet
Column Base Plate Calculation Report
13 pages
Handling Missing Values in Dataset - 9 Methods That You Need To Know - by Subha - Medium
No ratings yet
Handling Missing Values in Dataset - 9 Methods That You Need To Know - by Subha - Medium
28 pages
Data Wrangling Tutorial: Handling Missing Values
No ratings yet
Data Wrangling Tutorial: Handling Missing Values
14 pages
Razavi Monolithic Phase-Locked Loops and Clock Recovery Circuits
No ratings yet
Razavi Monolithic Phase-Locked Loops and Clock Recovery Circuits
39 pages
DSBDA Practical 2 Tutorial
No ratings yet
DSBDA Practical 2 Tutorial
14 pages
Electrical Machines: Induction Motors - Note
No ratings yet
Electrical Machines: Induction Motors - Note
41 pages
Lab2!17!07-2025 - Demonstrate Various Data Pre-Processing Techniques For A Given Dataset.
No ratings yet
Lab2!17!07-2025 - Demonstrate Various Data Pre-Processing Techniques For A Given Dataset.
17 pages
How To Handle Missing Data in Python. (Explained in 5 Easy Steps)
No ratings yet
How To Handle Missing Data in Python. (Explained in 5 Easy Steps)
10 pages
Well - Stimulation Techniques - For Geothermal - Projects in Sedimentary Basins
No ratings yet
Well - Stimulation Techniques - For Geothermal - Projects in Sedimentary Basins
175 pages
LP II Practical
No ratings yet
LP II Practical
5 pages
1.7-Identify and Handle Missing Values
No ratings yet
1.7-Identify and Handle Missing Values
27 pages
Steel Detaing Part1
No ratings yet
Steel Detaing Part1
114 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
Chapter 1. Data Preparation
No ratings yet
Chapter 1. Data Preparation
74 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Exp-12 Iaiml
No ratings yet
Exp-12 Iaiml
13 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Week 3
No ratings yet
Week 3
77 pages
Dealing With Missing Values
No ratings yet
Dealing With Missing Values
19 pages
Data Preprocessing 1
No ratings yet
Data Preprocessing 1
6 pages
Unit 3
No ratings yet
Unit 3
30 pages
Wa0061.
No ratings yet
Wa0061.
3 pages
Feature Engineering - MeanMedianDay 1 - Jupyter Notebook
No ratings yet
Feature Engineering - MeanMedianDay 1 - Jupyter Notebook
6 pages
Pandas: Data Cleaning Essentials
No ratings yet
Pandas: Data Cleaning Essentials
6 pages
Missing Data
No ratings yet
Missing Data
25 pages
Slides On DataII
No ratings yet
Slides On DataII
26 pages
Da Program Upto 6
No ratings yet
Da Program Upto 6
20 pages
Fda Exp3 E0323040
No ratings yet
Fda Exp3 E0323040
2 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Pandas
No ratings yet
Pandas
4 pages
Feature Engineering 1708311524
No ratings yet
Feature Engineering 1708311524
48 pages
2 T24Updates
No ratings yet
2 T24Updates
24 pages
DA Lab Manual r22
No ratings yet
DA Lab Manual r22
31 pages
Lecture 02
No ratings yet
Lecture 02
41 pages
Handling Missing Values in Python
No ratings yet
Handling Missing Values in Python
9 pages
Missing Data Handling
No ratings yet
Missing Data Handling
19 pages
DMML Lab Report 03
No ratings yet
DMML Lab Report 03
9 pages
Ass-2 Ds
No ratings yet
Ass-2 Ds
29 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
TP2 - ML - Handling Outliers
No ratings yet
TP2 - ML - Handling Outliers
5 pages
AI351 Lecture 1 - Data Preprocessing
No ratings yet
AI351 Lecture 1 - Data Preprocessing
8 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
DS Problem Statements and Codes
No ratings yet
DS Problem Statements and Codes
21 pages
Code Explanation For Date Types
No ratings yet
Code Explanation For Date Types
8 pages
ET 610 - Data Preprocessing
No ratings yet
ET 610 - Data Preprocessing
41 pages
Differentiates Kinds of Variables and Their Uses
No ratings yet
Differentiates Kinds of Variables and Their Uses
4 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
Microsoft Excel MCQs
No ratings yet
Microsoft Excel MCQs
15 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
DA Unit 2 15m Handling Missing Data
No ratings yet
DA Unit 2 15m Handling Missing Data
3 pages
5-Demonstrate Missing Value Analysis Using Sample Data.-06!01!2025
No ratings yet
5-Demonstrate Missing Value Analysis Using Sample Data.-06!01!2025
2 pages
Lec9 Dealing With Missing Values
No ratings yet
Lec9 Dealing With Missing Values
22 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Data Cleaning Techniques Guide
No ratings yet
Data Cleaning Techniques Guide
11 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Quality Matters: Pollution Exacerbates Water Scarcity and Sectoral Output Risks in China
No ratings yet
Quality Matters: Pollution Exacerbates Water Scarcity and Sectoral Output Risks in China
10 pages
2777959-Day 8 - Data Wrangling
No ratings yet
2777959-Day 8 - Data Wrangling
2 pages
Data Wrangling Assignment Guide
No ratings yet
Data Wrangling Assignment Guide
4 pages
Tutorial 6: Fluid Mechanics (CLB 11003) Chapter 6: Equipment in Fluid Flow
No ratings yet
Tutorial 6: Fluid Mechanics (CLB 11003) Chapter 6: Equipment in Fluid Flow
7 pages
Fortnightly Test Series 2023 24 - RM (P1) Test 01A
No ratings yet
Fortnightly Test Series 2023 24 - RM (P1) Test 01A
20 pages
WSM Vs ULM Vs LSM
No ratings yet
WSM Vs ULM Vs LSM
3 pages
Mineral Processing with CrossFlow
No ratings yet
Mineral Processing with CrossFlow
2 pages
Engineering Drawing PDF
No ratings yet
Engineering Drawing PDF
6 pages
Seminar On: 3D Printing
No ratings yet
Seminar On: 3D Printing
19 pages
Section 2.0 - Specifications Square Drive Tools: W ENG-5525-056 AD) Page 6 of 40 Eng Us
No ratings yet
Section 2.0 - Specifications Square Drive Tools: W ENG-5525-056 AD) Page 6 of 40 Eng Us
3 pages
Pandas Data Imputation Guide
No ratings yet
Pandas Data Imputation Guide
12 pages
Document From Sagar
No ratings yet
Document From Sagar
74 pages
CMM 26-11-15 PN CG7G0 Smoke Detector
No ratings yet
CMM 26-11-15 PN CG7G0 Smoke Detector
56 pages
Transformer Test Report
No ratings yet
Transformer Test Report
17 pages
Solution Guide-CAP3
No ratings yet
Solution Guide-CAP3
30 pages
Jaa Principles of Flight Demo
No ratings yet
Jaa Principles of Flight Demo
7 pages
Computer Science Engineering Course Outcomes
No ratings yet
Computer Science Engineering Course Outcomes
17 pages
Parametric Design For Nepal
No ratings yet
Parametric Design For Nepal
19 pages
Industrial Packing Design Sheet
No ratings yet
Industrial Packing Design Sheet
24 pages
IPE Pre-Test 2nd Sem 23-24
No ratings yet
IPE Pre-Test 2nd Sem 23-24
3 pages
To Check Yourself
No ratings yet
To Check Yourself
12 pages
Your Paper: You January 3, 2025
No ratings yet
Your Paper: You January 3, 2025
3 pages
Swiss GIS Projection Guide
No ratings yet
Swiss GIS Projection Guide
6 pages
CLS Aipmt-18-19 XII Phy Study-Package-6 SET-2 Chapter-8 PDF
No ratings yet
CLS Aipmt-18-19 XII Phy Study-Package-6 SET-2 Chapter-8 PDF
17 pages

Dealing With Missing Data - Jupyter Notebook

Uploaded by

Dealing With Missing Data - Jupyter Notebook

Uploaded by

9/6/22, 11:12 AM Dealing with Missing Data - Jupyter Notebook

Dealing with Missing Data

What is Missing Data?

Why dataset has Missing values?

Why to handle Missing values?

How to check Missing Data?

localhost:8888/notebooks/Dealing with Missing Data.ipynb 1/9

# Checking NULL Value in dataset

localhost:8888/notebooks/Dealing with Missing Data.ipynb 2/9

* Deleting a particular row or a column with no specific information is better, si

* Loss of information and data

2. Replacing With Mean/Median/Mode

localhost:8888/notebooks/Dealing with Missing Data.ipynb 3/9

localhost:8888/notebooks/Dealing with Missing Data.ipynb 4/9

localhost:8888/notebooks/Dealing with Missing Data.ipynb 5/9

3. Assigning An Unique Category

localhost:8888/notebooks/Dealing with Missing Data.ipynb 6/9

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket

localhost:8888/notebooks/Dealing with Missing Data.ipynb 7/9

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket

localhost:8888/notebooks/Dealing with Missing Data.ipynb 8/9

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket

localhost:8888/notebooks/Dealing with Missing Data.ipynb 9/9

You might also like