Data manipulation: Tidy data
Artur Rodrigues
2021-04-19
Handouts
http://bit.ly/datam2021
2 / 62
Introduction
3 / 62
Introduction
"Happy families are all alike; every unhappy family is unhappy in its own way."
--- Leo Tolstoy
"Tidy datasets are all alike, but every messy dataset is messy in its own way."
--- Hadley Wickham
Underlying theory: the Tidy Data paper published in the Journal of Statistical SoQware,
http://www.jstatsoQ.org/v59/i10/paper.
4 / 62
Introduction
consistent way to organise your data in R
requires some upfront work, but that work pays oU in the long term
allows you to spend more time on the analytic questions at hand
tidyr package: a member of the core tidyverse.
library
library(tidyr)
5 / 62
Tidy data
Displaying the same data in multiple ways
6 / 62
Tidy data
table1
## # A tibble: 14 x 4
## country date cases deaths
## <chr> <date> <int> <int>
## 1 Italy 2020-10-16 8803 83
## 2 Italy 2020-10-17 10009 55
## 3 Italy 2020-10-18 10925 47
## 4 Italy 2020-10-19 11705 69
## 5 Italy 2020-10-20 9337 73
## 6 Portugal 2020-10-16 2101 11
## 7 Portugal 2020-10-17 2608 21
## 8 Portugal 2020-10-18 2153 13
## 9 Portugal 2020-10-19 1856 19
## 10 Portugal 2020-10-20 1949 17
## 11 Spain 2020-10-16 15186 222
## 12 Spain 2020-10-17 NA NA
## 13 Spain 2020-10-18 NA NA
## 14 Spain 2020-10-19 37889 217
7 / 62
Tidy data
table2
## # A tibble: 28 x 4
## country date variable count
## <chr> <date> <chr> <int>
## 1 Italy 2020-10-16 cases 8803
## 2 Italy 2020-10-16 deaths 83
## 3 Italy 2020-10-17 cases 10009
## 4 Italy 2020-10-17 deaths 55
## 5 Italy 2020-10-18 cases 10925
## 6 Italy 2020-10-18 deaths 47
## 7 Italy 2020-10-19 cases 11705
## 8 Italy 2020-10-19 deaths 69
## 9 Italy 2020-10-20 cases 9337
## 10 Italy 2020-10-20 deaths 73
## # … with 18 more rows
8 / 62
Tidy data
table3
## # A tibble: 14 x 3
## country date cfr
## <chr> <date> <chr>
## 1 Italy 2020-10-16 83/8803
## 2 Italy 2020-10-17 55/10009
## 3 Italy 2020-10-18 47/10925
## 4 Italy 2020-10-19 69/11705
## 5 Italy 2020-10-20 73/9337
## 6 Portugal 2020-10-16 11/2101
## 7 Portugal 2020-10-17 21/2608
## 8 Portugal 2020-10-18 13/2153
## 9 Portugal 2020-10-19 19/1856
## 10 Portugal 2020-10-20 17/1949
## 11 Spain 2020-10-16 222/15186
## 12 Spain 2020-10-17 NA/NA
## 13 Spain 2020-10-18 NA/NA
## 14 Spain 2020-10-19 217/37889
9 / 62
Tidy data
table4a # cases
## # A tibble: 3 x 6
## country `2020-10-16` `2020-10-17` `2020-10-18` `2020-10-19`
## <chr> <int> <int> <int> <int>
## 1 Italy 8803 10009 10925 11705
## 2 Portug… 2101 2608 2153 1856
## 3 Spain 15186 NA NA 37889
## # … with 1 more variable: `2020-10-20` <int>
table4b # deaths
## # A tibble: 3 x 6
## country `2020-10-16` `2020-10-17` `2020-10-18` `2020-10-19`
## <chr> <int> <int> <int> <int>
## 1 Italy 83 55 47 69
## 2 Portug… 11 21 13 19
## 3 Spain 222 NA NA 217
## # … with 1 more variable: `2020-10-20` <int>
10 / 62
Tidy data
Same underlying data
One dataset, the tidy dataset, will be much easier to work with inside the tidyverse.
There are three interrelated rules which make a dataset tidy:
Y. Each variable must have its own column.
[. Each observation must have its own row.
\. Each value must have its own cell.
11 / 62
Tidy data
12 / 62
Practical instructions
Y. Put each dataset in a tibble.
[. Put each variable in a column.
13 / 62
Why ensure that your data is tidy?
Y. Consistent way of storing data: it's easier to learn the tools that work with it
[. It allows R's vectorised nature to shine (ex.: mutate and summary functions)
dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with
tidy data
14 / 62
Compute CFR (deaths per case)
table1 %>%
mutate(cfr = deaths/cases)
## # A tibble: 14 x 5
## country date cases deaths cfr
## <chr> <date> <int> <int> <dbl>
## 1 Italy 2020-10-16 8803 83 0.00943
## 2 Italy 2020-10-17 10009 55 0.00550
## 3 Italy 2020-10-18 10925 47 0.00430
## 4 Italy 2020-10-19 11705 69 0.00589
## 5 Italy 2020-10-20 9337 73 0.00782
## 6 Portugal 2020-10-16 2101 11 0.00524
## 7 Portugal 2020-10-17 2608 21 0.00805
## 8 Portugal 2020-10-18 2153 13 0.00604
## 9 Portugal 2020-10-19 1856 19 0.0102
## 10 Portugal 2020-10-20 1949 17 0.00872
## 11 Spain 2020-10-16 15186 222 0.0146
## 12 Spain 2020-10-17 NA NA NA
## 13 Spain 2020-10-18 NA NA NA
## 14 Spain 2020-10-19 37889 217 0.00573
15 / 62
Compute cases per country
table1 %>%
count(country, wt = cases)
## # A tibble: 3 x 2
## country n
## * <chr> <int>
## 1 Italy 50779
## 2 Portugal 10667
## 3 Spain 53075
16 / 62
Visualise changes over time
library
library(ggplot2)
ggplot(table1, aes(date, cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country))
17 / 62
Exercises
Y. Compute the cfr for table2, and table4a + table4b.
You will need to perform four operations:
Extract the number of deaths per country per day.
Extract the number of cases per country per day.
Dihde deaths by cases.
Store back in the appropriate place.
Which representation is easiest to work with? Which is hardest? Why?
Y. Recreate the plot showing change in cases over time using table2 instead of
table1. What do you need to do irst?
18 / 62
Tentative solution for table2
Extract the number of deaths per country per day:
table_deaths <- table2 %>%
filter(variable == 'deaths') %>%
select(country, date, count) %>%
rename(deaths=count)
Extract the number of cases per country per day:
table_cases <- table2 %>%
filter(variable == 'cases') %>%
select(country, date, count) %>%
rename(cases=count)
19 / 62
Tentative solution table2
Dihde deaths by cases:
cfr = table_deaths$deaths/table_cases$cases
Store back in the appropriate place:
tidy2 <- table_cases
tidy2$deaths <- table_deaths$deaths
tidy2$cfr <- cfr
20 / 62
Tentative solution table2
tidy2
## # A tibble: 14 x 5
## country date cases deaths cfr
## <chr> <date> <int> <int> <dbl>
## 1 Italy 2020-10-16 8803 83 0.00943
## 2 Italy 2020-10-17 10009 55 0.00550
## 3 Italy 2020-10-18 10925 47 0.00430
## 4 Italy 2020-10-19 11705 69 0.00589
## 5 Italy 2020-10-20 9337 73 0.00782
## 6 Portugal 2020-10-16 2101 11 0.00524
## 7 Portugal 2020-10-17 2608 21 0.00805
## 8 Portugal 2020-10-18 2153 13 0.00604
## 9 Portugal 2020-10-19 1856 19 0.0102
## 10 Portugal 2020-10-20 1949 17 0.00872
## 11 Spain 2020-10-16 15186 222 0.0146
## 12 Spain 2020-10-17 NA NA NA
## 13 Spain 2020-10-18 NA NA NA
## 14 Spain 2020-10-19 37889 217 0.00573
21 / 62
Recreate the plot using table2
data_plot <- table2 %>%
filter(variable=='cases') %>%
rename(cases = count)
data_plot
## # A tibble: 14 x 4
## country date variable cases
## <chr> <date> <chr> <int>
## 1 Italy 2020-10-16 cases 8803
## 2 Italy 2020-10-17 cases 10009
## 3 Italy 2020-10-18 cases 10925
## 4 Italy 2020-10-19 cases 11705
## 5 Italy 2020-10-20 cases 9337
## 6 Portugal 2020-10-16 cases 2101
## 7 Portugal 2020-10-17 cases 2608
## 8 Portugal 2020-10-18 cases 2153
## 9 Portugal 2020-10-19 cases 1856
## 10 Portugal 2020-10-20 cases 1949
## 11 Spain 2020-10-16 cases 15186
## 12 Spain 2020-10-17 cases NA
## 13 Spain 2020-10-18 cases NA
## 14 Spain 2020-10-19 cases 37889
22 / 62
Recreate the plot using table2
ggplot(data_plot, aes(date, cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country))
23 / 62
Pivoting
24 / 62
Pivoting
Main reasons for untidy data:
Y. Unfamiliarity with the principles of tidy data.
[. Data is oQen organised to facilitate some use other than analysis (ex: data
entry).
The irst step is always to igure out what the variables and observations are.
The second step is to resolve one of two common problems:
Y. One variable might be spread across multiple columns.
[. One observation might be scattered across multiple rows.
To ix these problems: pivot_longer() and pivot_wider().
25 / 62
Longer
When some of the column names are not names of variables, but values of a variable.
table4a
## # A tibble: 3 x 6
## country `2020-10-16` `2020-10-17` `2020-10-18` `2020-10-19`
## <chr> <int> <int> <int> <int>
## 1 Italy 8803 10009 10925 11705
## 2 Portug… 2101 2608 2153 1856
## 3 Spain 15186 NA NA 37889
## # … with 1 more variable: `2020-10-20` <int>
26 / 62
Longer
We need to pivot the oUending columns into a new pair of variables.
Three parameters:
The set of columns whose names are values, not variables.
The name of the variable to move the column names to. Here it is date.
The name of the variable to move the column values to. Here it’s cases.
The columns to pivot are speciied with dplyr::select() style notation.
27 / 62
Tabel4a
table4a %>%
pivot_longer(c(`2020-10-16`, `2020-10-17`, `2020-10-18`, `2020-10-19`, `2020-10-20`),
names_to = "date",
values_to = "cases")
## # A tibble: 15 x 3
## country date cases
## <chr> <chr> <int>
## 1 Italy 2020-10-16 8803
## 2 Italy 2020-10-17 10009
## 3 Italy 2020-10-18 10925
## 4 Italy 2020-10-19 11705
## 5 Italy 2020-10-20 9337
## 6 Portugal 2020-10-16 2101
## 7 Portugal 2020-10-17 2608
## 8 Portugal 2020-10-18 2153
## 9 Portugal 2020-10-19 1856
## 10 Portugal 2020-10-20 1949
## 11 Spain 2020-10-16 15186
## 12 Spain 2020-10-17 NA
## 13 Spain 2020-10-18 NA
## 14 Spain 2020-10-19 37889
## 15 Spain 2020-10-20 NA
28 / 62
Tabel4a
table4a %>%
pivot_longer(!country,
names_to = "date",
values_to = "cases")
## # A tibble: 15 x 3
## country date cases
## <chr> <chr> <int>
## 1 Italy 2020-10-16 8803
## 2 Italy 2020-10-17 10009
## 3 Italy 2020-10-18 10925
## 4 Italy 2020-10-19 11705
## 5 Italy 2020-10-20 9337
## 6 Portugal 2020-10-16 2101
## 7 Portugal 2020-10-17 2608
## 8 Portugal 2020-10-18 2153
## 9 Portugal 2020-10-19 1856
## 10 Portugal 2020-10-20 1949
## 11 Spain 2020-10-16 15186
## 12 Spain 2020-10-17 NA
## 13 Spain 2020-10-18 NA
## 14 Spain 2020-10-19 37889
## 15 Spain 2020-10-20 NA
29 / 62
Table4b
table4b %>%
pivot_longer(!country,
names_to = "date",
values_to = "deaths")
## # A tibble: 15 x 3
## country date deaths
## <chr> <chr> <int>
## 1 Italy 2020-10-16 83
## 2 Italy 2020-10-17 55
## 3 Italy 2020-10-18 47
## 4 Italy 2020-10-19 69
## 5 Italy 2020-10-20 73
## 6 Portugal 2020-10-16 11
## 7 Portugal 2020-10-17 21
## 8 Portugal 2020-10-18 13
## 9 Portugal 2020-10-19 19
## 10 Portugal 2020-10-20 17
## 11 Spain 2020-10-16 222
## 12 Spain 2020-10-17 NA
## 13 Spain 2020-10-18 NA
## 14 Spain 2020-10-19 217
## 15 Spain 2020-10-20 NA
30 / 62
Combine in a single tibble
tidy4a <- table4a %>%
pivot_longer(!country,
names_to = "date",
values_to = "cases")
tidy4b <- table4b %>%
pivot_longer(!country,
names_to = "date",
values_to = "deaths")
31 / 62
Combine in a single tibble
full_join(tidy4a, tidy4b)
## Joining, by = c("country", "date")
## # A tibble: 15 x 4
## country date cases deaths
## <chr> <chr> <int> <int>
## 1 Italy 2020-10-16 8803 83
## 2 Italy 2020-10-17 10009 55
## 3 Italy 2020-10-18 10925 47
## 4 Italy 2020-10-19 11705 69
## 5 Italy 2020-10-20 9337 73
## 6 Portugal 2020-10-16 2101 11
## 7 Portugal 2020-10-17 2608 21
## 8 Portugal 2020-10-18 2153 13
## 9 Portugal 2020-10-19 1856 19
## 10 Portugal 2020-10-20 1949 17
## 11 Spain 2020-10-16 15186 222
## 12 Spain 2020-10-17 NA NA
## 13 Spain 2020-10-18 NA NA
## 14 Spain 2020-10-19 37889 217
## 15 Spain 2020-10-20 NA NA
32 / 62
Wider
pivot_wider() is the opposite of pivot_longer()
You use it when an observation is scattered across multiple rows.
table2
## # A tibble: 28 x 4
## country date variable count
## <chr> <date> <chr> <int>
## 1 Italy 2020-10-16 cases 8803
## 2 Italy 2020-10-16 deaths 83
## 3 Italy 2020-10-17 cases 10009
## 4 Italy 2020-10-17 deaths 55
## 5 Italy 2020-10-18 cases 10925
## 6 Italy 2020-10-18 deaths 47
## 7 Italy 2020-10-19 cases 11705
## 8 Italy 2020-10-19 deaths 69
## 9 Italy 2020-10-20 cases 9337
## 10 Italy 2020-10-20 deaths 73
## # … with 18 more rows
33 / 62
Wider
Two parameters:
The column to take variable names from. Here, it’s type.
The column to take values from. Here it’s count.
table2 %>%
pivot_wider(names_from = variable, values_from = count)
## # A tibble: 14 x 4
## country date cases deaths
## <chr> <date> <int> <int>
## 1 Italy 2020-10-16 8803 83
## 2 Italy 2020-10-17 10009 55
## 3 Italy 2020-10-18 10925 47
## 4 Italy 2020-10-19 11705 69
## 5 Italy 2020-10-20 9337 73
## 6 Portugal 2020-10-16 2101 11
## 7 Portugal 2020-10-17 2608 21
## 8 Portugal 2020-10-18 2153 13
## 9 Portugal 2020-10-19 1856 19
## 10 Portugal 2020-10-20 1949 17
## 11 Spain 2020-10-16 15186 222 34 / 62
## 12 Spain 2020-10-17 NA NA
Separating and uniting
35 / 62
Separate
separate() pulls apart one column into multiple columns, by splitting wherever a
separator character appears.
table3
## # A tibble: 14 x 3
## country date cfr
## <chr> <date> <chr>
## 1 Italy 2020-10-16 83/8803
## 2 Italy 2020-10-17 55/10009
## 3 Italy 2020-10-18 47/10925
## 4 Italy 2020-10-19 69/11705
## 5 Italy 2020-10-20 73/9337
## 6 Portugal 2020-10-16 11/2101
## 7 Portugal 2020-10-17 21/2608
## 8 Portugal 2020-10-18 13/2153
## 9 Portugal 2020-10-19 19/1856
## 10 Portugal 2020-10-20 17/1949
## 11 Spain 2020-10-16 222/15186
## 12 Spain 2020-10-17 NA/NA
## 13 Spain 2020-10-18 NA/NA
## 14 Spain 2020-10-19 217/37889
36 / 62
Separate
table3 %>%
separate(cfr, into = c("deaths", "cases"))
## # A tibble: 14 x 4
## country date deaths cases
## <chr> <date> <chr> <chr>
## 1 Italy 2020-10-16 83 8803
## 2 Italy 2020-10-17 55 10009
## 3 Italy 2020-10-18 47 10925
## 4 Italy 2020-10-19 69 11705
## 5 Italy 2020-10-20 73 9337
## 6 Portugal 2020-10-16 11 2101
## 7 Portugal 2020-10-17 21 2608
## 8 Portugal 2020-10-18 13 2153
## 9 Portugal 2020-10-19 19 1856
## 10 Portugal 2020-10-20 17 1949
## 11 Spain 2020-10-16 222 15186
## 12 Spain 2020-10-17 NA NA
## 13 Spain 2020-10-18 NA NA
## 14 Spain 2020-10-19 217 37889
37 / 62
Separate
By default, separate() will split values wherever it sees a non-alphanumeric character
(i.e. a character that isn't a number or letter).
If you wish to use a speciic character to separate a column, you can pass the character
to the sep argument.
table3 %>%
separate(cfr, into = c("deaths", "cases"), sep = "/")
## # A tibble: 14 x 4
## country date deaths cases
## <chr> <date> <chr> <chr>
## 1 Italy 2020-10-16 83 8803
## 2 Italy 2020-10-17 55 10009
## 3 Italy 2020-10-18 47 10925
## 4 Italy 2020-10-19 69 11705
## 5 Italy 2020-10-20 73 9337
## 6 Portugal 2020-10-16 11 2101
## 7 Portugal 2020-10-17 21 2608
## 8 Portugal 2020-10-18 13 2153
## 9 Portugal 2020-10-19 19 1856
## 10 Portugal 2020-10-20 17 1949
## 11 Spain 2020-10-16 222 15186 38 / 62
Separate
cases and deaths are character columns.
table3 %>%
separate(cfr, into = c("cases", "deaths"), convert = TRUE)
## # A tibble: 14 x 4
## country date cases deaths
## <chr> <date> <int> <int>
## 1 Italy 2020-10-16 83 8803
## 2 Italy 2020-10-17 55 10009
## 3 Italy 2020-10-18 47 10925
## 4 Italy 2020-10-19 69 11705
## 5 Italy 2020-10-20 73 9337
## 6 Portugal 2020-10-16 11 2101
## 7 Portugal 2020-10-17 21 2608
## 8 Portugal 2020-10-18 13 2153
## 9 Portugal 2020-10-19 19 1856
## 10 Portugal 2020-10-20 17 1949
## 11 Spain 2020-10-16 222 15186
## 12 Spain 2020-10-17 NA NA
## 13 Spain 2020-10-18 NA NA
## 14 Spain 2020-10-19 217 37889
39 / 62
Separate
You can also pass a vector of integers to sep.
separate() will interpret the integers as positions to split at.
positive values start at 1 on the far-leQ of the strings;
negative value start at -1 on the far-right of the strings.
table5 <- table3 %>%
separate(date, into = c("year", "-month-day"), sep = 4)
table5
## # A tibble: 14 x 4
## country year `-month-day` cfr
## <chr> <chr> <chr> <chr>
## 1 Italy 2020 -10-16 83/8803
## 2 Italy 2020 -10-17 55/10009
## 3 Italy 2020 -10-18 47/10925
## 4 Italy 2020 -10-19 69/11705
## 5 Italy 2020 -10-20 73/9337
## 6 Portugal 2020 -10-16 11/2101
## 7 Portugal 2020 -10-17 21/2608
## 8 Portugal 2020 -10-18 13/2153 40 / 62
Unite
table5 %>%
unite(date, year, `-month-day`)
## # A tibble: 14 x 3
## country date cfr
## <chr> <chr> <chr>
## 1 Italy 2020_-10-16 83/8803
## 2 Italy 2020_-10-17 55/10009
## 3 Italy 2020_-10-18 47/10925
## 4 Italy 2020_-10-19 69/11705
## 5 Italy 2020_-10-20 73/9337
## 6 Portugal 2020_-10-16 11/2101
## 7 Portugal 2020_-10-17 21/2608
## 8 Portugal 2020_-10-18 13/2153
## 9 Portugal 2020_-10-19 19/1856
## 10 Portugal 2020_-10-20 17/1949
## 11 Spain 2020_-10-16 222/15186
## 12 Spain 2020_-10-17 NA/NA
## 13 Spain 2020_-10-18 NA/NA
## 14 Spain 2020_-10-19 217/37889
41 / 62
Unite
table5 %>%
unite(date, year, `-month-day`, sep="")
## # A tibble: 14 x 3
## country date cfr
## <chr> <chr> <chr>
## 1 Italy 2020-10-16 83/8803
## 2 Italy 2020-10-17 55/10009
## 3 Italy 2020-10-18 47/10925
## 4 Italy 2020-10-19 69/11705
## 5 Italy 2020-10-20 73/9337
## 6 Portugal 2020-10-16 11/2101
## 7 Portugal 2020-10-17 21/2608
## 8 Portugal 2020-10-18 13/2153
## 9 Portugal 2020-10-19 19/1856
## 10 Portugal 2020-10-20 17/1949
## 11 Spain 2020-10-16 222/15186
## 12 Spain 2020-10-17 NA/NA
## 13 Spain 2020-10-18 NA/NA
## 14 Spain 2020-10-19 217/37889
42 / 62
Excercises
Y. What do the extra and fill arguments do in separate()? Experiment with the
various options for the following two toy datasets.
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
Y. Both unite() and separate() have a remove argument. What does it do? Why
would you set it to FALSE?
43 / 62
Missing values
44 / 62
Missing values
A value can be missing in one of two possible ways:
Explicitly, i.e. magged with NA.
Implicitly, i.e. simply not present in the data.
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
45 / 62
Make implicit values explicit
stocks %>%
pivot_wider(names_from = year, values_from = return
return)
## # A tibble: 4 x 3
## qtr `2015` `2016`
## <dbl> <dbl> <dbl>
## 1 1 1.88 NA
## 2 2 0.59 0.92
## 3 3 0.35 0.17
## 4 4 NA 2.66
46 / 62
Turn explicit missing values implicit
stocks %>%
pivot_wider(names_from = year, values_from = return
return) %>%
pivot_longer(
cols = c(`2015`, `2016`),
names_to = "year",
values_to = "return",
values_drop_na = TRUE
)
## # A tibble: 6 x 3
## qtr year return
## <dbl> <chr> <dbl>
## 1 1 2015 1.88
## 2 2 2015 0.59
## 3 2 2016 0.92
## 4 3 2015 0.35
## 5 3 2016 0.17
## 6 4 2016 2.66
47 / 62
Make implicit values explicit 2
stocks %>%
complete(year, qtr)
## # A tibble: 8 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 1 NA
## 6 2016 2 0.92
## 7 2016 3 0.17
## 8 2016 4 2.66
48 / 62
Complete missing values
treatment <- tribble(
~ person, ~ treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)
49 / 62
Complete missing values
treatment %>%
fill(person)
## # A tibble: 4 x 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Derrick Whitmore 1 7
## 2 Derrick Whitmore 2 10
## 3 Derrick Whitmore 3 9
## 4 Katherine Burke 1 4
50 / 62
Case Study
51 / 62
Case Study
data_pt
## # A tibble: 238 x 88
## data data_dados confirmados confirmados_ars…
## <chr> <chr> <dbl> <dbl>
## 1 26-0… 26-02-202… 0 0
## 2 27-0… 27-02-202… 0 0
## 3 28-0… 28-02-202… 0 0
## 4 29-0… 29-02-202… 0 0
## 5 01-0… 01-03-202… 0 0
## 6 02-0… 02-03-202… 2 2
## 7 03-0… 03-03-202… 4 2
## 8 04-0… 04-03-202… 6 3
## 9 05-0… 05-03-202… 9 5
## 10 06-0… 06-03-202… 13 8
## # … with 228 more rows, and 84 more variables:
## # confirmados_arscentro <dbl>, confirmados_arslvt <dbl>,
## # confirmados_arsalentejo <dbl>,
## # confirmados_arsalgarve <dbl>, confirmados_acores <dbl>,
## # confirmados_madeira <dbl>, confirmados_estrangeiro <dbl>,
## # confirmados_novos <dbl>, recuperados <dbl>, obitos <dbl>,
## # internados <dbl>, internados_uci <dbl>, lab <dbl>,
## # suspeitos <dbl>, vigilancia <dbl>, n_confirmados <dbl>,
## # cadeias_transmissao <dbl>, transmissao_importada <dbl>,
## # confirmados_0_9_f <dbl>, confirmados_0_9_m <dbl>,
## # confirmados_10_19_f <dbl>, confirmados_10_19_m <dbl>, 52 / 62
## # confirmados_20_29_f <dbl>, confirmados_20_29_m <dbl>,
Case Study
data_pt1 <- data_pt %>%
select(data,
confirmados_0_9_f:confirmados_80_plus_m,
obitos_0_9_f:obitos_80_plus_m)
data_pt1
## # A tibble: 238 x 37
## data confirmados_0_9… confirmados_0_9… confirmados_10_…
## <chr> <dbl> <dbl> <dbl>
## 1 26-0… NA NA NA
## 2 27-0… NA NA NA
## 3 28-0… NA NA NA
## 4 29-0… NA NA NA
## 5 01-0… NA NA NA
## 6 02-0… NA NA NA
## 7 03-0… 0 0 0
## 8 04-0… 0 0 0
## 9 05-0… 0 0 0
## 10 06-0… 0 0 0
## # … with 228 more rows, and 33 more variables:
## # confirmados_10_19_m <dbl>, confirmados_20_29_f <dbl>,
## # confirmados_20_29_m <dbl>, confirmados_30_39_f <dbl>,
## # confirmados_30_39_m <dbl>, confirmados_40_49_f <dbl>,
## # confirmados_40_49_m <dbl>, confirmados_50_59_f <dbl>,
## # confirmados_50_59_m <dbl>, confirmados_60_69_f <dbl>,
## # confirmados_60_69_m <dbl>, confirmados_70_79_f <dbl>, 53 / 62
## # confirmados_70_79_m <dbl>, confirmados_80_plus_f <dbl>,
Case Study
Tidy the dataset
Make a plot of the evolution of deaths over time by age group
54 / 62
Tidying
data_pt1 <- data_pt1 %>%
pivot_longer(
cols = c(confirmados_0_9_f:confirmados_80_plus_m,
obitos_0_9_f:obitos_80_plus_m),
names_to = "key",
values_to = "count",
values_drop_na = TRUE
)
55 / 62
Tidying
data_pt1 <- data_pt1 %>%
separate(key,
c("type",
"age_from",
"age_to",
"gender"),
sep = "_")
data_pt1 <- data_pt1 %>%
unite(age, age_from, age_to, sep="-")
56 / 62
Tidying
data_pt1 <- data_pt1 %>%
pivot_wider(
names_from = "type",
values_from = "count"
)
data_pt1$data <- as.Date(data_pt1$data, format='%d-%m-%Y')
57 / 62
Plot
ggplot(filter(data_pt1, gender=='m'),
aes(data, obitos, col=age)) +
geom_line()
58 / 62
Non-tidy data
59 / 62
Non-tidy data
Main reasons to use non-tidy data structures:
Alternative representations may have substantial performance or space
advantages.
Specialised ields have evolved their own conventions.
If data does it naturally into a rectangular structure composed of observations and
variables, tidy data should be the default choice.
Tidy data is not the only way.
60 / 62
Resources
61 / 62
Resources
R for Data Science
Tidy data
Tidy data paper
Non-tidy data
62 / 62