Identifying and Imputation of missing values.
Objective:
I. Identify missing values.
II. Replacing missing values
III. Importing dataset from Web and local drives
IV. Installing necessary packages to use in data manipulations.
Run = ctrl+Alt_enter
Pipeline = ctrl+shift+m
Install these packages used for data cleaning and manipulation
(tidyr ) Tidy Messy Data • tidyr https://tidyr.tidyverse.org
# The easiest way to get tidyr is to install the whole tidyverse:
# You may install waldo package to use a function called compare
install.packages("tidyverse")
# Alternatively, install just dplyr:
install.packages("tidyr")
library (tidyr)
Data sets
Lemon2016 or starwars from the above package
1. Import the file.
2. Identify missing values, character values and Identifying NA and NAN values.
3. Count Missing values are na, NA, space, in each column for missing values.
Replacing values with required Numeric
Importing CSV with commas separated from the web.
1
superbowl <- read.table(
"https://www.portfolioprobe.com/R/blog/dowjones_super.csv",
sep=",", header=TRUE)
superbowl
2
If your CSV file is small enough, you may simply use Base R’s read.csv function to
import it.
data2 <- read.csv("lemonade2016.csv", header = TRUE)
data2
//////////////////////////////////////////////////////////////////////////////////////////
install.packages("readr")
library(readr)
data2 <- read_csv("Lemonade2016.csv") #this is working very well
3
Use any tool on your disposal to clean, the one that work for you is the best
If you are unable to import, resave the file as CSV -UTF-8(comer separated values)>> see below
4
//////////////////////////////////////////////////////////////////////////////////////////////////////////
Identify the missing values only NA is identify, the – and na is not recognized.
5
Count the total number of NA missing values
> sum(is.na(data2))
[1] 11
Missing values for each column
6
Input NA using the mean of a column with the
data2$Lemon[is.na(data2$Lemon)] <- round(mean(data2$Lemon, na.rm = TRUE))
Recheck the number of missing values again
7
The missing values in lemon Column has been removed but others still have their
missing values,
Use the same code to replace missing values in other columns eg Location etc.
You may decide to replace all the missing values at once, but this is very risky [you have to
be careful for obvious reason]
Is a matter of knowing the package to install that has the functionalities (zoo package)
[you may try installing and loading zoo package and using aggregate (x) function]
Also
imputeTS package work very well for only columns which is numeric for obvious reasons
install.packages("imputeTS")
library(imputeTS)
8
If the number of missing values is small that it will not affect the overall analysis, you may
drop it. Drop the missing value in the orange and Location.
Remove all row that has Na by this code
data2_new <- data2[, colSums(is.na(data2)) < nrow(data2)]
/////////////////////////////////////////////////////////////////////////////////////////////////////
Remember Every concept must be researched
/////////////////////////////////////////////////////////////////////////////////////////////////////
Cleaning Data sequence2
Package and dataset to install and load
library(dplyr)
library(tidyr)
library(skimr)
starwars dataset
9
library(skimr) is use to show the data skim below
Let's extract height, mass and gender from the dataset
data <- starwars %>%
select(height, mass, gender)
data
10
Splitting the data by installing and loading
library(rsample) and library(tidymodels)
///////////////////////////////////////////////////////////////////////////////////////////////////////////
data_split <- initial_split(data)
data_train <- training(data_split)
data_test <-testing(data_split)
Checking the number of split data
You must clean the datasets. (like in the first tutorial)
11
Keep the data_test to be used for validation and clean the data_train
Creating a new feature bmi
data_train<- data_train %>%
mutate(bmi = mass/(height*height))
data_train
To check for missing values
Use skim() to check for missing values.
skim(data_train)
Or any(is.na(data_train))
12
any(is.na(data_train))
colSums(is.na(data_train))
13
Dropping the missing values that are very few
Dropping height and gender missing values
Imputation of missing values for mass and bmi
(Both example below work)
ifelse(condition,true ,false)
data_tr_imputed<-data_train %>%
mutate(mass =ifelse(is.na(mass), mean(mass,na.rm = TRUE),mass),
bmi =ifelse(is.na(bmi), mean(bmi,na.rm = TRUE),bmi))
data_tr_imputed
14
gender is a categorical variable and must be
encoded[change to 1or 0]
data_tr_imputed_encoded<-data_tr_imputed %>%
mutate(gender_masculine = ifelse(gender =="masculine",1,0)) %>%
select(-gender)
data_tr_imputed_encoded
Feature Scaling
Creating a function for normalisation
normalize<- function(feature){
(feature = mean(feature))/sd(feature)
}
15
Complete processes Pipeline
Putting the whole processes of data cleaning into one
Steps
I. Feature Engineering.
II. Missing values.
III. Encoding categorical variables.
IV. Feature Scaling.
16
data_train %>%
mutate(bmi = mass/(height*height)) %>%
drop_na(height,gender) %>%
mutate(mass =ifelse(is.na(mass), mean(mass,na.rm = TRUE),mass),
bmi =ifelse(is.na(bmi), mean(bmi,na.rm = TRUE),bmi)) %>%
mutate(gender_masculine = ifelse(gender =="masculine",1,0)) %>%
select(-gender) %>%
mutate_all(normalize)
Using Recipes for Data cleaning pipeline
install.packages("recipes")
[Recipes packages provides functions for doing all the coding above]
data_train %>%
recipe() %>%
step_mutate(BMI=mass/(height*height)) %>%
step_naomit(height,gender) %>%
step_meanimpute(mass,BMI) %>%
step_dummy(gender) %>%
step_normalize(everything()) %>%
prep()
17
/////////////////////////ENCODING CATEGORICAL DATASET Using Iris/////////////////////////////
18
//////////////////////////////////////////////////////////////////////////////////////////
iris %>%
mutate(Species_versicolor = ifelse(Species =="versicolor",1,0),
Species_virginica = ifelse(Species =="virginica",1,0)) %>%#remove the Species
select(-Species)
19
20