4.18 Data Wrangling Slides Part1
4.18 Data Wrangling Slides Part1
slides: bit.ly/berd_tidy1
pdf: bit.ly/berd_tidy1_pdf
Load les for today's workshop
Open the slides of this workshop: # install.packages("tidyverse")
bit.ly/berd_tidy1 library(tidyverse)
Open the pre-workshop homework library(lubridate)
Follow steps 1-5: Download zip folder, demo_data <-
open berd_tidyverse_project.Rproj read_csv("data/yrbss_demo.csv")
Open a new R script and run the
commands to the right
Allison Horst
2 / 54
Learning objectives
Part 1:
Part 2:
3 / 54
Getting started
Alison Horst 4 / 54
What is data wrangling?
"data janitor work" xing errors and poorly formatted data
importing data elements
cleaning data transforming columns and rows
changing shape of data ltering, subsetting
standardize le paths
keep everything together
a whole folder can be shared and run on another computer
6 / 54
Useful keyboard shortcuts
action mac windows/linux
Try typing (with shortcut) and
run code in script cmd + enter ctrl + enter running
<- option + - alt + -
y <- 5
%>% y
cmd + shift + m ctrl + shift + m
(covered later)
Now, in the console, press the up
arrow.
7 / 54
Tibbles
hexbin
8 / 54
Data frames vs. tibbles
Previously we learned about data frames A tibble is a data frame but with perks
9 / 54
Import data as a data frame (try this)
Base R functions import data as data frames (read.csv, read.table, etc)
# A tibble: 20 x 11
id age sex grade race4 bmi weight_kg text_while_driv…
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
1 3.35e5 17 y… Fema… 10th White 27.6 66.2 <NA>
2 6.39e5 16 y… Fema… 9th <NA> 29.3 84.8 <NA>
3 9.22e5 14 y… Male 9th White 18.2 57.6 <NA>
4 9.23e5 15 y… Male 9th White 21.4 60.3 <NA>
5 9.24e5 15 y… Male 10th Blac… 19.6 63.5 <NA>
6 9.26e5 16 y… Male 10th All … 22.2 70.3 <NA>
7 9.34e5 16 y… Fema… 10th All … 21.0 45.4 <NA>
8 9.35e5 17 y… Fema… 12th All … 17.5 43.1 <NA>
9 1.10e6 15 y… Male 10th All … 22.5 79.4 <NA>
10 1.11e6 17 y… Fema… 9th Blac… 26.6 68.0 <NA>
11 1.31e6 16 y… Male 10th Hisp… 21.2 67.1 0 days
12 1.31e6 17 y… Male 12th Hisp… 19.5 56.2 1 or 2 days
13 1.31e6 17 y… Male 11th Hisp… 20.6 61.7 1 or 2 days
14 1.31e6 15 y… Fema… 10th Hisp… 27.5 70.3 0 days 11 / 54
Compare & contrast data frame and tibble
Run the code below
data frame
glimpse(mydata_df)
str(mydata_df) # How are glimpse() and str() different?
head(mydata_df)
summary(mydata_df)
class(mydata_df) # What information does class() give?
tibble
glimpse(mydata_tib)
str(mydata_tib)
head(mydata_tib)
summary(mydata_tib)
class(mydata_tib)
12 / 54
Tibble perks
Viewing tibbles:
variable types are given (character, factor, double, integer, boolean, date)
number of rows & columns shown are limited for easier viewing
Other perks:
13 / 54
Tidy Data
Allison Horst
14 / 54
What are tidy data?
1. Each variable forms a column
2. Each observation forms a row
3. Each value has its own cell
15 / 54
Untidy data: example 1
untidy_data <- tibble(
name = c("Ana","Bob","Cara"),
meds = c("advil 600mg 2xday","tylenol 650mg 4xday", "advil 200mg 3xday")
)
untidy_data
# A tibble: 3 x 2
name meds
<chr> <chr>
1 Ana advil 600mg 2xday
2 Bob tylenol 650mg 4xday
3 Cara advil 200mg 3xday
16 / 54
Tidy data: example 1
You will learn how to do this!
untidy_data %>%
separate(col = meds, into = c("med_name","dose_mg","times_per_day"), sep=" ") %>%
mutate(times_per_day = as.numeric(str_remove(times_per_day, "xday")),
dose_mg = as.numeric(str_remove(dose_mg, "mg")))
# A tibble: 3 x 4
name med_name dose_mg times_per_day
<chr> <chr> <dbl> <dbl>
1 Ana advil 600 2
2 Bob tylenol 650 4
3 Cara advil 200 3
17 / 54
Untidy data: example 2
untidy_data2 <- tibble(
name = c("Ana","Bob","Cara"),
wt_07_01_2018 = c(100, 150, 140),
wt_08_01_2018 = c(104, 155, 138),
wt_09_01_2018 = c(NA, 160, 142)
)
untidy_data2
# A tibble: 3 x 4
name wt_07_01_2018 wt_08_01_2018 wt_09_01_2018
<chr> <dbl> <dbl> <dbl>
1 Ana 100 104 NA
2 Bob 150 155 160
3 Cara 140 138 142
18 / 54
Tidy data: example 2
You will learn how to do this!
untidy_data2 %>%
gather(key = "date", value = "weight", -name) %>%
mutate(date = str_remove(date,"wt_"),
date = dmy(date)) # dmy() is a function in the lubridate package
# A tibble: 9 x 3
name date weight
<chr> <date> <dbl>
1 Ana 2018-01-07 100
2 Bob 2018-01-07 150
3 Cara 2018-01-07 140
4 Ana 2018-01-08 104
5 Bob 2018-01-08 155
6 Cara 2018-01-08 138
7 Ana 2018-01-09 NA
8 Bob 2018-01-09 160
9 Cara 2018-01-09 142
19 / 54
How to tidy?
Allison Horst
20 / 54
Tools for tidying data
tidyverse functions
tidyverse is a suite of packages that implement tidy methods for data importing,
cleaning, and wrangling
load the tidyverse packages by running the code library(tidyverse)
see pre-workshop homework for code to install tidyverse
subset rows/columns
add new rows/columns
split apart or unite columns
join together different data sets (part 2)
make data long or wide (part 2)
21 / 54
How to use the pipe %>%
The pipe operator %>% strings together commands to be performed sequentially
# A tibble: 3 x 11
id age sex grade race4 bmi weight_kg text_while_driv…
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
1 335340 17 y… Fema… 10th White 27.6 66.2 <NA>
2 638618 16 y… Fema… 9th <NA> 29.3 84.8 <NA>
3 922382 14 y… Male 9th White 18.2 57.6 <NA>
# … with 3 more variables: smoked_ever <chr>, bullied_past_12mo <lgl>,
# height_m <dbl>
Always rst list the tibble that the commands are being applied to
Can use multiple pipes to run multiple commands in sequence
What does the following code do?
22 / 54
About the data
Data from the CDC's Youth Risk Behavior Surveillance System (YRBSS)
the data in yrbss_demo.csv are a subset of data in the R package yrbss, which includes
YRBSS from 1991-2013
23 / 54
Subsetting data
math: >, <, >=, <= is.na() to lter based on missing values
double = for "is equal to": == %in% to lter based on group membership
& (and) ! in front negates the statement, as in
| (or) !is.na(age)
!= (not equal) !(grade %in% c("9th","10th"))
# A tibble: 10,375 x 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 333862 17 years o… Fema… 12th White White 20.2 57.2
2 1095530 15 years o… Male 10th Black or Af… Black or Af… 28.0 85.7
3 1303997 14 years o… Male 9th All other r… Multiple - … 24.5 66.7
4 926649 16 years o… Male 11th All other r… Asian 20.5 70.3
5 506337 18 years o… Male 12th Hispanic/La… Hispanic/La… 33.1 123.
6 1307180 16 years o… Male 10th Hispanic/La… Hispanic/La… 21.8 66.7
7 1312128 15 years o… Fema… 10th White White 22.0 65.8 25 / 54
Compare to base R
Bracket method: need to repeat tibble Pipe method: list tibble name once
name No $ needed since uses "non-standard
Need to use $ evaluation": filter() knows grade is a
Very nested and confusing to read column in demo_data
Keeps NAs Removes NAs
27 / 54
Subset by columns
28 / 54
select() ∼ columns
select columns (variables)
no quotes needed around variable names
can be used to rearrange columns
uses special syntax that is exible and has many options
# A tibble: 20,000 x 2
record grade
<dbl> <chr>
1 931897 10th
2 333862 12th
3 36253 11th
4 1095530 10th
5 1303997 9th
6 261619 9th
7 926649 11th
8 1309082 12th
9 506337 12th
10 180494 10th
# … with 19,990 more rows
29 / 54
Compare to base R
Need brackets No quotes needed and easier to read.
Need quotes around column names More exible, either of following work:
# A tibble: 20,000 x 3
record age sex # A tibble: 20,000 x 3
<dbl> <chr> <chr> record age sex
1 931897 15 years old Female <dbl> <chr> <chr>
2 333862 17 years old Female 1 931897 15 years old Female
3 36253 18 years old or older Male 2 333862 17 years old Female
4 1095530 15 years old Male 3 36253 18 years old or older Male
5 1303997 14 years old Male 4 1095530 15 years old Male
6 261619 17 years old Male 5 1303997 14 years old Male
7 926649 16 years old Male 6 261619 17 years old Male
8 1309082 17 years old Male 7 926649 16 years old Male
9 506337 18 years old or older Male 8 1309082 17 years old Male
10 180494 14 years old Male 9 506337 18 years old or older Male
# … with 19,990 more rows 10 180494 14 years old Male
# … with 19,990 more rows
# A tibble: 20,000 x 3 30 / 54
Column selection syntax options
There are many ways to select a set of variable names (columns):
32 / 54
rename() ∼ columns
renames column variables
# A tibble: 20,000 x 8
id age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 931897 15 years o… Fema… 10th White White 17.2 54.4
2 333862 17 years o… Fema… 12th White White 20.2 57.2
3 36253 18 years o… Male 11th Hispanic/La… Hispanic/La… NA NA
4 1095530 15 years o… Male 10th Black or Af… Black or Af… 28.0 85.7
5 1303997 14 years o… Male 9th All other r… Multiple - … 24.5 66.7
6 261619 17 years o… Male 9th All other r… <NA> NA NA
7 926649 16 years o… Male 11th All other r… Asian 20.5 70.3
8 1309082 17 years o… Male 12th White White 19.3 59.0
9 506337 18 years o… Male 12th Hispanic/La… Hispanic/La… 33.1 123.
10 180494 14 years o… Male 10th Black or Af… Black or Af… NA NA
# … with 19,990 more rows
33 / 54
Practice
# Remember: to save output into the same tibble you would use <-
newdata <- newdata %>% select(-record)
Do the following data wrangling steps in order so that the output from the previous step is
the input for the next step. Save the results in each step as newdata.
1. Import demo_data.csv in the data folder if you haven't already done so.
2. Filter newdata to only keep "Asian" or "Native Hawaiian/other PI" subjects that are in the
9th grade, and save again as newdata.
3. Filter newdata to remove subjects younger than 13, and save as newdata.
5. How many rows does the resulting newdata have? How many columns?
34 / 54
Changing the data
Alison Horst 35 / 54
Make new variables
36 / 54
mutate()
Use mutate() to add new columns to a tibble
# A tibble: 20,000 x 3
record bmi stweight
<dbl> <dbl> <dbl>
1 931897 17.2 54.4
2 333862 20.2 57.2
3 36253 NA NA
4 1095530 28.0 85.7
5 1303997 24.5 66.7
6 261619 NA NA
7 926649 20.5 70.3
8 1309082 19.3 59.0
9 506337 33.1 123.
10 180494 NA NA 37 / 54
mutate() practice
What do the following commands do?
First guess and then try them out.
38 / 54
case_when() with mutate()
Use case_when() to create multi-valued variables that depend on an existing column
# A tibble: 6 x 2
bmi bmi_group
<dbl> <chr>
1 17.2 underweight
2 20.2 normal
3 NA <NA>
4 28.0 overweight
5 24.5 normal 39 / 54
separate() and unite()
when one column has multiple types of paste columns together using a
information separator
removes original column by default removes original columns by default
# A tibble: 20,000 x 1
# A tibble: 20,000 x 5 sexgr
a y o w w2 <chr>
<chr> <chr> <chr> <chr> <chr> 1 Female:10th
1 15 years old <NA> <NA> 2 Female:12th
2 17 years old <NA> <NA> 3 Male:11th
3 18 years old or older 4 Male:10th
4 15 years old <NA> <NA> 5 Male:9th
5 14 years old <NA> <NA> 6 Male:9th
6 17 years old <NA> <NA> 7 Male:11th
40 / 54
separate() and unite() practice
What do the following commands do?
41 / 54
More commands to lter rows
42 / 54
Remove rows with missing data
na.omit removes all rows with any missing (NA) values in any column
# A tibble: 12,897 x 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 931897 15 years o… Fema… 10th White White 17.2 54.4
2 333862 17 years o… Fema… 12th White White 20.2 57.2
3 1095530 15 years o… Male 10th Black or Af… Black or Af… 28.0 85.7
4 1303997 14 years o… Male 9th All other r… Multiple - … 24.5 66.7
5 926649 16 years o… Male 11th All other r… Asian 20.5 70.3
6 1309082 17 years o… Male 12th White White 19.3 59.0
7 506337 18 years o… Male 12th Hispanic/La… Hispanic/La… 33.1 123.
8 1307180 16 years o… Male 10th Hispanic/La… Hispanic/La… 21.8 66.7
9 1312128 15 years o… Fema… 10th White White 22.0 65.8
10 770177 16 years o… Fema… 10th White White 32.4 86.2
# … with 12,887 more rows
# A tibble: 4 x 2 # A tibble: 3 x 2
name race name race
<chr> <chr> <chr> <chr>
1 Ana Hispanic 1 Ana Hispanic
2 Bob Other 2 Bob Other
3 Cara White 3 Cara White
4 Ana Hispanic
44 / 54
Order rows: arrange()
Use arrange() to order the rows by the values in speci ed columns
# A tibble: 3 x 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 635432 13 years… Female 9th Hispanic/Lat… Hispanic/Lat… 13.2 27.7
2 501608 15 years… Male 9th All other ra… Asian 13.2 47.6
3 1097740 16 years… Male 9th Black or Afr… Black or Afr… 13.3 45.4
# A tibble: 3 x 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 324452 16 years old Male 11th Black or Af… Black or Af… 53.9 91.2
2 1310082 18 years ol… Male 11th Black or Af… Black or Af… 53.5 160.
3 328160 18 years ol… Male <NA> Black or Af… Black or Af… 53.4 128.
45 / 54
Practice
Do the following data wrangling steps in order so that the output from the previous step is
the input for the next step. Save the results in each step as newdata.
1. Import demo_data.csv in the data folder if you haven't already done so.
2. Create a variable called grade_num that has the numeric grade number (use
as.numeric()).
5. Create a binary variable called bmi_normal that is equal to 1 when bmi is between 18.5 to
24.9 and 0 when it is outside that range.
46 / 54
Advanced column commands
47 / 54
Mutating multiple columns at once: mutate_*
variants of mutate() that are useful for mutating multiple columns at once
mutate_at(), mutate_if(), mutate_all(), etc.
which columns get mutated depends on a predicate, can be:
a function that returns TRUE/FALSE like is.numeric(), or
variable names through vars()
# mutate_if
demo_data %>% mutate_if(is.numeric, as.character) # as.character() is a function
demo_data %>% mutate_if(is.character, tolower) # tolower() is a function
demo_data %>% mutate_if(is.double, round, digits=0) # arguments to function can go after
# mutate_at
demo_data %>% mutate_at(vars(age:grade), toupper) # toupper() is a function
demo_data %>% mutate_at(vars(bmi,stweight), log)
demo_data %>% mutate_at(vars(contains("race")), str_detect, pattern = "White")
# mutate_all
demo_data %>% mutate_all(as.character)
48 / 54
Selecting & renaming multiple columns
select_*() & rename_*() are variants of select() and rename()
use like mutate_*() options on previous slide
49 / 54
The pipe operator %>% revisited
a function performed on (usually) a data frame or tibble
the result is a transformed data set as a tibble
Suppose you want to perform a series of operations on a data.frame or tibble mydata using
hypothetical functions f(), g(), h():
Perform f(mydata)
use the output as an argument to g(): g(f(mydata))
use the output as an argument to h(): h(g(f(mydata)))
50 / 54
Why use the pipe?
makes code more readable
h(f(g(mydata))) can get complicated with multiple arguments
i.e. h(f(g(mydata, na.rm=T), print=FALSE), type = "mean")
51 / 54
Resources - Tidyverse & Data Wrangling
Links
52 / 54
Possible Future Workshop Topics?
reproducible reports in R
tables
ggplot2 visualization
advanced tidyverse: functions, purrr
statistical modeling in R
53 / 54
Contact info:
Jessica Minnier: [email protected]
54 / 54