Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
51 views22 pages

CleaningData Chapter 4

The document discusses cleaning historical weather data from Boston, USA spanning 12 months from December 2014. It describes how the data was initially "dirty" with column names as values, variables coded incorrectly, missing and extreme values. The document then outlines various techniques in R to clean the data such as understanding the data structure, looking at the data, visualizing it, tidying column names, converting values and dates, finding and handling missing values and errors. It summarizes accomplishing inspecting, tidying, improving dates, correcting codes and errors, finding missing data, and visualizing the cleaned weather data.

Uploaded by

Mahmoud Trigui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views22 pages

CleaningData Chapter 4

The document discusses cleaning historical weather data from Boston, USA spanning 12 months from December 2014. It describes how the data was initially "dirty" with column names as values, variables coded incorrectly, missing and extreme values. The document then outlines various techniques in R to clean the data such as understanding the data structure, looking at the data, visualizing it, tidying column names, converting values and dates, finding and handling missing values and errors. It summarizes accomplishing inspecting, tidying, improving dates, correcting codes and errors, finding missing data, and visualizing the cleaned weather data.

Uploaded by

Mahmoud Trigui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

CLEANING DATA IN R

Time to put
it all together!

Cleaning Data in R

The challenge

Historical weather data from Boston, USA

12 months beginning Dec 2014

The data are dirty

Column names are values

Variables coded incorrectly

Missing and extreme values

Clean the data!

Cleaning Data in R

Understanding the structure of your data

class() - Class of data object

dim() - Dimensions of data

names() - Column names

str() - Preview of data with helpful details

glimpse() - Be!er version of str() from dplyr

summary() - Summary of data

Cleaning Data in R

Looking at your data

head() - View top of dataset

tail() - View bo!om of dataset

print() - View entire dataset (not recommended!)

Cleaning Data in R

Visualizing your data

hist() - View histogram of a single variable

plot() - View plot of two variables

CLEANING DATA IN R

Let's practice!

CLEANING DATA IN R

Let's tidy the data

Cleaning Data in R

Column names are values


> head(weather)
X year month
measure
1 1 2014
12 Max.TemperatureF
2 2 2014
12 Mean.TemperatureF
3 3 2014
12 Min.TemperatureF
4 4 2014
12
Max.Dew.PointF
5 5 2014
12
MeanDew.PointF
6 6 2014
12
Min.DewpointF

X1
64
52
39
46
40
26

X2
42
38
33
40
27
17

X3
51
44
37
49
42
24

X4
43
37
30
24
21
13

X5
42
34
26
37
25
12

X6
45
42
38
45
40
36

X7
38
30
21
36
20
-3

X8
29
24
18
28
16
3

X9
49
39
29
49
41
28

...
...
...
...
...
...
...

Cleaning Data in R

Values are variable names


> head(weather2)
X year month
measure day value
1 1 2014
12 Max.TemperatureF X1
64
2 2 2014
12 Mean.TemperatureF X1
52
3 3 2014
12 Min.TemperatureF X1
39
4 4 2014
12
Max.Dew.PointF X1
46
5 5 2014
12
MeanDew.PointF X1
40
6 6 2014
12
Min.DewpointF X1
26

CLEANING DATA IN R

Let's practice!

CLEANING DATA IN R

Prepare the data


for analysis

Cleaning Data in R

Dates with lubridate


# Load the lubridate package
> library(lubridate)
# Experiment with basic lubridate functions
> ymd("2015-08-25")
year-month-day
[1] "2015-08-25 UTC"
> ymd("2015 August 25")
year-month-day
[1] "2015-08-25 UTC"
> mdy("August 25, 2015")
month-day-year
[1] "2015-08-25 UTC"
> hms("13:33:09")
hour-minute-second
[1] "13H 33M 9S"
> ymd_hms("2015/08/25 13.33.09")
[1] "2015-08-25 13:33:09 UTC" year-month-day hour-minute-second

Cleaning Data in R

Type conversions
> as.character(2016)
[1] "2016"
> as.numeric(TRUE)
[1] 1
> as.integer(99)
[1] 99
> as.factor("something")
[1] something
Levels: something
> as.logical(0)
[1] FALSE

CLEANING DATA IN R

Let's practice!

CLEANING DATA IN R

Missing, extreme, and


unexpected values

Cleaning Data in R

Finding missing values


# Create a small dataset
> x <- data.frame(a = c(2, 5, NA, 8),
b = c(NA, 34, 9, NA))
# Return data frame of TRUEs and FALSEs
> is.na(x)
a
b
[1,] FALSE TRUE
[2,] FALSE FALSE
[3,] TRUE FALSE
[4,] FALSE TRUE
# Count number of TRUEs
> sum(is.na(x))
[1] 3
# Find indices of missing values in column b
> which(is.na(x$b))
[1] 1 4

Cleaning Data in R

Identifying errors

Context ma!ers!

Plausible ranges

Numeric variables in weather data

Percentages (0-100)

Temperatures (Fahrenheit)

Wind speeds (miles per hour)

Pressures (inches of mercury)

Distances (miles)

Eighths (of cloud cover)

CLEANING DATA IN R

Let's practice!

CLEANING DATA IN R

Your data are clean!

Cleaning Data in R

Clean weather data


# View head of clean data
> head(weather6)
date
events cloud_cover max_dew_point_f ...
1 2014-12-01
Rain
6
46 ...
2 2014-12-02 Rain-Snow
7
40 ...
3 2014-12-03
Rain
8
49 ...
4 2014-12-04
None
3
24 ...
5 2014-12-05
Rain
5
37 ...
6 2014-12-06
Rain
8
45 ...
# View tail of clean data
date events cloud_cover max_dew_point_f ...
361 2015-11-26
None
6
49 ...
362 2015-11-27
None
7
52 ...
363 2015-11-28
Rain
8
50 ...
364 2015-11-29
None
4
33 ...
365 2015-11-30
None
6
26 ...
366 2015-12-01
Rain
7
43 ...

Cleaning Data in R

Summary of your accomplishments

Inspected the data

Tidied the data

Improved date representations

Dealt with incorrect variable codings

Found and dealt with missing data

Identified and corrected errors

Visualized the result

CLEANING DATA IN R

Congratulations!

You might also like