0% found this document useful (0 votes)

10 views20 pages

Tutorial-Identifying and Imputation of Missing Values

The document outlines the process of identifying and imputing missing values in datasets using R programming. It includes steps for importing datasets, installing necessary packages, and methods for replacing or dropping missing values. Additionally, it covers feature engineering, encoding categorical variables, and normalizing data as part of the data cleaning pipeline.

Uploaded by

a2020lehong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views20 pages

Tutorial-Identifying and Imputation of Missing Values

Uploaded by

a2020lehong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Identifying and Imputation of missing values.

Objective:
I. Identify missing values.
II. Replacing missing values
III. Importing dataset from Web and local drives
IV. Installing necessary packages to use in data manipulations.

Run = ctrl+Alt_enter
Pipeline = ctrl+shift+m
Install these packages used for data cleaning and manipulation
 (tidyr ) Tidy Messy Data • tidyr https://tidyr.tidyverse.org
# The easiest way to get tidyr is to install the whole tidyverse:
# You may install waldo package to use a function called compare

install.packages("tidyverse")
# Alternatively, install just dplyr:
install.packages("tidyr")
library (tidyr)
Data sets
Lemon2016 or starwars from the above package
1. Import the file.
2. Identify missing values, character values and Identifying NA and NAN values.
3. Count Missing values are na, NA, space, in each column for missing values.
Replacing values with required Numeric

Importing CSV with commas separated from the web.

1
superbowl <- read.table(
"https://www.portfolioprobe.com/R/blog/dowjones_super.csv",
sep=",", header=TRUE)

superbowl

2
If your CSV file is small enough, you may simply use Base R’s read.csv function to
import it.

data2 <- read.csv("lemonade2016.csv", header = TRUE)

data2

//////////////////////////////////////////////////////////////////////////////////////////

install.packages("readr")

library(readr)

data2 <- read_csv("Lemonade2016.csv") #this is working very well

3
Use any tool on your disposal to clean, the one that work for you is the best

If you are unable to import, resave the file as CSV -UTF-8(comer separated values)>> see below

4
//////////////////////////////////////////////////////////////////////////////////////////////////////////

Identify the missing values only NA is identify, the – and na is not recognized.

5
Count the total number of NA missing values

> sum(is.na(data2))
[1] 11

Missing values for each column

6
Input NA using the mean of a column with the

data2$Lemon[is.na(data2$Lemon)] <- round(mean(data2$Lemon, na.rm = TRUE))

Recheck the number of missing values again

7
The missing values in lemon Column has been removed but others still have their
missing values,
Use the same code to replace missing values in other columns eg Location etc.

You may decide to replace all the missing values at once, but this is very risky [you have to
be careful for obvious reason]
Is a matter of knowing the package to install that has the functionalities (zoo package)
[you may try installing and loading zoo package and using aggregate (x) function]
Also
imputeTS package work very well for only columns which is numeric for obvious reasons

install.packages("imputeTS")
library(imputeTS)

8
If the number of missing values is small that it will not affect the overall analysis, you may
drop it. Drop the missing value in the orange and Location.
Remove all row that has Na by this code
data2_new <- data2[, colSums(is.na(data2)) < nrow(data2)]

/////////////////////////////////////////////////////////////////////////////////////////////////////

Remember Every concept must be researched

/////////////////////////////////////////////////////////////////////////////////////////////////////

Cleaning Data sequence2

Package and dataset to install and load

library(dplyr)
library(tidyr)
library(skimr)
starwars dataset

9
library(skimr) is use to show the data skim below

Let's extract height, mass and gender from the dataset

data <- starwars %>%

select(height, mass, gender)
data

10
Splitting the data by installing and loading
library(rsample) and library(tidymodels)
///////////////////////////////////////////////////////////////////////////////////////////////////////////
data_split <- initial_split(data)
data_train <- training(data_split)
data_test <-testing(data_split)

Checking the number of split data

You must clean the datasets. (like in the first tutorial)

11
Keep the data_test to be used for validation and clean the data_train

Creating a new feature bmi

data_train<- data_train %>%

mutate(bmi = mass/(height*height))
data_train

To check for missing values

Use skim() to check for missing values.

skim(data_train)
Or any(is.na(data_train))

12
any(is.na(data_train))

colSums(is.na(data_train))

13
Dropping the missing values that are very few
Dropping height and gender missing values

Imputation of missing values for mass and bmi

(Both example below work)
ifelse(condition,true ,false)

data_tr_imputed<-data_train %>%
mutate(mass =ifelse(is.na(mass), mean(mass,na.rm = TRUE),mass),
bmi =ifelse(is.na(bmi), mean(bmi,na.rm = TRUE),bmi))
data_tr_imputed

14
gender is a categorical variable and must be
encoded[change to 1or 0]
data_tr_imputed_encoded<-data_tr_imputed %>%
mutate(gender_masculine = ifelse(gender =="masculine",1,0)) %>%
select(-gender)
data_tr_imputed_encoded

Feature Scaling

Creating a function for normalisation

normalize<- function(feature){
(feature = mean(feature))/sd(feature)
}

15
Complete processes Pipeline
Putting the whole processes of data cleaning into one

Steps

I. Feature Engineering.
II. Missing values.
III. Encoding categorical variables.
IV. Feature Scaling.

16
data_train %>%
mutate(bmi = mass/(height*height)) %>%
drop_na(height,gender) %>%
mutate(mass =ifelse(is.na(mass), mean(mass,na.rm = TRUE),mass),
bmi =ifelse(is.na(bmi), mean(bmi,na.rm = TRUE),bmi)) %>%
mutate(gender_masculine = ifelse(gender =="masculine",1,0)) %>%
select(-gender) %>%
mutate_all(normalize)

Using Recipes for Data cleaning pipeline

install.packages("recipes")

[Recipes packages provides functions for doing all the coding above]

data_train %>%
recipe() %>%
step_mutate(BMI=mass/(height*height)) %>%
step_naomit(height,gender) %>%
step_meanimpute(mass,BMI) %>%
step_dummy(gender) %>%
step_normalize(everything()) %>%
prep()

17
/////////////////////////ENCODING CATEGORICAL DATASET Using Iris/////////////////////////////

18
//////////////////////////////////////////////////////////////////////////////////////////
iris %>%
mutate(Species_versicolor = ifelse(Species =="versicolor",1,0),
Species_virginica = ifelse(Species =="virginica",1,0)) %>%#remove the Species
select(-Species)

19
20

04 Data Cleaning in R
No ratings yet
04 Data Cleaning in R
36 pages
Data Cleaning R
No ratings yet
Data Cleaning R
16 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Chapter 2. Pre-Processing Data
No ratings yet
Chapter 2. Pre-Processing Data
37 pages
Data Cleaning Techniques Guide
No ratings yet
Data Cleaning Techniques Guide
11 pages
Chapter 1. Data Preparation
No ratings yet
Chapter 1. Data Preparation
74 pages
Data Cleaning Essentials
No ratings yet
Data Cleaning Essentials
42 pages
Lab 3 DWM
No ratings yet
Lab 3 DWM
5 pages
Data Cleaning Using R
No ratings yet
Data Cleaning Using R
5 pages
Data Pre-processing in Machine Learning
No ratings yet
Data Pre-processing in Machine Learning
84 pages
Data Cleaning R
No ratings yet
Data Cleaning R
2 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Unit 2
No ratings yet
Unit 2
76 pages
R Studio: Scripts, Data Handling & Cleaning
No ratings yet
R Studio: Scripts, Data Handling & Cleaning
25 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Data Cleaning Using R
No ratings yet
Data Cleaning Using R
26 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
R Data Cleaning Techniques
No ratings yet
R Data Cleaning Techniques
26 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Handling Missing Values
No ratings yet
Handling Missing Values
4 pages
CleaningData Chapter 3
No ratings yet
CleaningData Chapter 3
29 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
Missing Data
No ratings yet
Missing Data
14 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Handling Missing Values in Python
No ratings yet
Handling Missing Values in Python
9 pages
Practical Preprocessing and Data Cleaning
No ratings yet
Practical Preprocessing and Data Cleaning
51 pages
Data Preparation .1
No ratings yet
Data Preparation .1
37 pages
Chapter 2
No ratings yet
Chapter 2
46 pages
Chapter 3
No ratings yet
Chapter 3
58 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Data Preparation: Treatment of Missing Values
No ratings yet
Data Preparation: Treatment of Missing Values
26 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Lec 3 Data Preprocessing and Transformation
No ratings yet
Lec 3 Data Preprocessing and Transformation
73 pages
Data - Preprocessing - 2
No ratings yet
Data - Preprocessing - 2
10 pages
Analytics People Programming Parte 5
No ratings yet
Analytics People Programming Parte 5
14 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
71 pages
Study Material Data Preprocessing
No ratings yet
Study Material Data Preprocessing
11 pages
Da (22C01156)
No ratings yet
Da (22C01156)
26 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Data Cleaning
No ratings yet
Data Cleaning
2 pages
Unit2 Part2 Da
No ratings yet
Unit2 Part2 Da
45 pages
Exp-12 Iaiml
No ratings yet
Exp-12 Iaiml
13 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
TP2 - ML - Handling Outliers
No ratings yet
TP2 - ML - Handling Outliers
5 pages
R-Programming Lab Mannual
No ratings yet
R-Programming Lab Mannual
33 pages
Lec9 Dealing With Missing Values
No ratings yet
Lec9 Dealing With Missing Values
22 pages
Part 5
No ratings yet
Part 5
22 pages
Unit 1
No ratings yet
Unit 1
21 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
DMML Lab Report 03
No ratings yet
DMML Lab Report 03
9 pages
Lec 4
No ratings yet
Lec 4
9 pages
1.data Cleaning Screening
No ratings yet
1.data Cleaning Screening
21 pages
Geotech 1 Lecture 2 Structure
No ratings yet
Geotech 1 Lecture 2 Structure
38 pages
The Effectiveness of Indian Mango (Magnifera Indica) As Fertilizer For Monggo Plants (Vigna Radiata)
No ratings yet
The Effectiveness of Indian Mango (Magnifera Indica) As Fertilizer For Monggo Plants (Vigna Radiata)
6 pages
Cambridge International AS & A Level: PHYSICS 9702/34
No ratings yet
Cambridge International AS & A Level: PHYSICS 9702/34
12 pages
Committed Vs Aspirational OKRs The Idea OKRE V1 0
No ratings yet
Committed Vs Aspirational OKRs The Idea OKRE V1 0
3 pages
CRD-L: Direct Acting Pressure Reducing Valve
No ratings yet
CRD-L: Direct Acting Pressure Reducing Valve
4 pages
Ek Ehsaas Ek Vishwas
No ratings yet
Ek Ehsaas Ek Vishwas
32 pages
BBA Students: Globalization Insights
No ratings yet
BBA Students: Globalization Insights
4 pages
Maths Grade 12 15 August 2025
No ratings yet
Maths Grade 12 15 August 2025
9 pages
Regulatory Environment For Food and Beverage in Brazil
No ratings yet
Regulatory Environment For Food and Beverage in Brazil
12 pages
Sage X3 Server Sizing Guide
No ratings yet
Sage X3 Server Sizing Guide
6 pages
Day 4 Book Inter Jan'25
No ratings yet
Day 4 Book Inter Jan'25
3 pages
Enzymes in Industrial Applications
No ratings yet
Enzymes in Industrial Applications
18 pages
Selection Errors MBA
No ratings yet
Selection Errors MBA
3 pages
Compression: DMET501 - Introduction To Media Engineering
No ratings yet
Compression: DMET501 - Introduction To Media Engineering
26 pages
2024.05.08 Poki Playtest Privacy Statement
No ratings yet
2024.05.08 Poki Playtest Privacy Statement
3 pages
B.Com Management Exam Prep Guide
100% (1)
B.Com Management Exam Prep Guide
7 pages
How To Post Bail For Your Temporary Liberty
No ratings yet
How To Post Bail For Your Temporary Liberty
4 pages
TR Bro Updated Erl221
No ratings yet
TR Bro Updated Erl221
4 pages
Cao Wang FTA EMA
No ratings yet
Cao Wang FTA EMA
5 pages
Mos Word 2016 - Core Practice Exam 3 Training
No ratings yet
Mos Word 2016 - Core Practice Exam 3 Training
9 pages
Cash Management: Guide To Trading Internationally
No ratings yet
Cash Management: Guide To Trading Internationally
4 pages
BSC 6000
No ratings yet
BSC 6000
54 pages
Woodhouse: Midgley Gardens
No ratings yet
Woodhouse: Midgley Gardens
36 pages
PP Riseofchina
No ratings yet
PP Riseofchina
16 pages
2019 - X - Important - Comparison of Change Management
No ratings yet
2019 - X - Important - Comparison of Change Management
20 pages
Pantry Evaluation Proposal Internship
No ratings yet
Pantry Evaluation Proposal Internship
6 pages
Chapter 1 Governments and Individuals PDF
No ratings yet
Chapter 1 Governments and Individuals PDF
24 pages
Labour Welfare Scheme
No ratings yet
Labour Welfare Scheme
20 pages
MATH9944-Chapter Summary-5144
No ratings yet
MATH9944-Chapter Summary-5144
16 pages
Texas Scorecard Mail - Re - ARIF Results - Rolando Ortiz Redacted
No ratings yet
Texas Scorecard Mail - Re - ARIF Results - Rolando Ortiz Redacted
8 pages

Tutorial-Identifying and Imputation of Missing Values

Uploaded by

Tutorial-Identifying and Imputation of Missing Values

Uploaded by

Identifying and Imputation of missing values.

Importing CSV with commas separated from the web.

data2 <- read.csv("lemonade2016.csv", header = TRUE)

data2 <- read_csv("Lemonade2016.csv") #this is working very well

Missing values for each column

data2$Lemon[is.na(data2$Lemon)] <- round(mean(data2$Lemon, na.rm = TRUE))

Recheck the number of missing values again

Remember Every concept must be researched

Cleaning Data sequence2

Let's extract height, mass and gender from the dataset

data <- starwars %>%

Checking the number of split data

You must clean the datasets. (like in the first tutorial)

Creating a new feature bmi

data_train<- data_train %>%

To check for missing values

Use skim() to check for missing values.

Imputation of missing values for mass and bmi

Creating a function for normalisation

Using Recipes for Data cleaning pipeline

You might also like