0% found this document useful (0 votes)

46 views62 pages

Tidy Data

1) The document discusses tidy data principles and provides examples of tidy and untidy representations of the same COVID-19 case and death data for Italy, Portugal, and Spain. 2) There are three rules for tidy data: each variable must have its own column, each observation must have its own row, and each value must have its own cell. 3) The document provides instructions for tidying different representations of the data, computing case fatality rates, and visualizing changes in cases over time using tidy data principles and functions from the tidyverse in R.

Uploaded by

Guilherme Soares

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views62 pages

Tidy Data

Uploaded by

Guilherme Soares

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Data manipulation: Tidy data

Artur Rodrigues

2021-04-19
Handouts
http://bit.ly/datam2021

2 / 62
Introduction

3 / 62
Introduction
"Happy families are all alike; every unhappy family is unhappy in its own way."

--- Leo Tolstoy

"Tidy datasets are all alike, but every messy dataset is messy in its own way."

--- Hadley Wickham

Underlying theory: the Tidy Data paper published in the Journal of Statistical SoQware,
http://www.jstatsoQ.org/v59/i10/paper.

4 / 62
Introduction
consistent way to organise your data in R

requires some upfront work, but that work pays oU in the long term

allows you to spend more time on the analytic questions at hand

tidyr package: a member of the core tidyverse.

library
library(tidyr)

5 / 62
Tidy data
Displaying the same data in multiple ways

6 / 62
Tidy data
table1

## # A tibble: 14 x 4
## country date cases deaths
## <chr> <date> <int> <int>
## 1 Italy 2020-10-16 8803 83
## 2 Italy 2020-10-17 10009 55
## 3 Italy 2020-10-18 10925 47
## 4 Italy 2020-10-19 11705 69
## 5 Italy 2020-10-20 9337 73
## 6 Portugal 2020-10-16 2101 11
## 7 Portugal 2020-10-17 2608 21
## 8 Portugal 2020-10-18 2153 13
## 9 Portugal 2020-10-19 1856 19
## 10 Portugal 2020-10-20 1949 17
## 11 Spain 2020-10-16 15186 222
## 12 Spain 2020-10-17 NA NA
## 13 Spain 2020-10-18 NA NA
## 14 Spain 2020-10-19 37889 217

7 / 62
Tidy data
table2

## # A tibble: 28 x 4
## country date variable count
## <chr> <date> <chr> <int>
## 1 Italy 2020-10-16 cases 8803
## 2 Italy 2020-10-16 deaths 83
## 3 Italy 2020-10-17 cases 10009
## 4 Italy 2020-10-17 deaths 55
## 5 Italy 2020-10-18 cases 10925
## 6 Italy 2020-10-18 deaths 47
## 7 Italy 2020-10-19 cases 11705
## 8 Italy 2020-10-19 deaths 69
## 9 Italy 2020-10-20 cases 9337
## 10 Italy 2020-10-20 deaths 73
## # … with 18 more rows

8 / 62
Tidy data
table3

## # A tibble: 14 x 3
## country date cfr
## <chr> <date> <chr>
## 1 Italy 2020-10-16 83/8803
## 2 Italy 2020-10-17 55/10009
## 3 Italy 2020-10-18 47/10925
## 4 Italy 2020-10-19 69/11705
## 5 Italy 2020-10-20 73/9337
## 6 Portugal 2020-10-16 11/2101
## 7 Portugal 2020-10-17 21/2608
## 8 Portugal 2020-10-18 13/2153
## 9 Portugal 2020-10-19 19/1856
## 10 Portugal 2020-10-20 17/1949
## 11 Spain 2020-10-16 222/15186
## 12 Spain 2020-10-17 NA/NA
## 13 Spain 2020-10-18 NA/NA
## 14 Spain 2020-10-19 217/37889

9 / 62
Tidy data
table4a # cases

## # A tibble: 3 x 6
## country `2020-10-16` `2020-10-17` `2020-10-18` `2020-10-19`
## <chr> <int> <int> <int> <int>
## 1 Italy 8803 10009 10925 11705
## 2 Portug… 2101 2608 2153 1856
## 3 Spain 15186 NA NA 37889
## # … with 1 more variable: `2020-10-20` <int>

table4b # deaths

## # A tibble: 3 x 6
## country `2020-10-16` `2020-10-17` `2020-10-18` `2020-10-19`
## <chr> <int> <int> <int> <int>
## 1 Italy 83 55 47 69
## 2 Portug… 11 21 13 19
## 3 Spain 222 NA NA 217
## # … with 1 more variable: `2020-10-20` <int>

10 / 62
Tidy data
Same underlying data
One dataset, the tidy dataset, will be much easier to work with inside the tidyverse.
There are three interrelated rules which make a dataset tidy:
Y. Each variable must have its own column.
[. Each observation must have its own row.
\. Each value must have its own cell.

11 / 62
Tidy data

12 / 62
Practical instructions
Y. Put each dataset in a tibble.
[. Put each variable in a column.

13 / 62
Why ensure that your data is tidy?
Y. Consistent way of storing data: it's easier to learn the tools that work with it
[. It allows R's vectorised nature to shine (ex.: mutate and summary functions)

dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with
tidy data

14 / 62
Compute CFR (deaths per case)
table1 %>%
mutate(cfr = deaths/cases)

## # A tibble: 14 x 5
## country date cases deaths cfr
## <chr> <date> <int> <int> <dbl>
## 1 Italy 2020-10-16 8803 83 0.00943
## 2 Italy 2020-10-17 10009 55 0.00550
## 3 Italy 2020-10-18 10925 47 0.00430
## 4 Italy 2020-10-19 11705 69 0.00589
## 5 Italy 2020-10-20 9337 73 0.00782
## 6 Portugal 2020-10-16 2101 11 0.00524
## 7 Portugal 2020-10-17 2608 21 0.00805
## 8 Portugal 2020-10-18 2153 13 0.00604
## 9 Portugal 2020-10-19 1856 19 0.0102
## 10 Portugal 2020-10-20 1949 17 0.00872
## 11 Spain 2020-10-16 15186 222 0.0146
## 12 Spain 2020-10-17 NA NA NA
## 13 Spain 2020-10-18 NA NA NA
## 14 Spain 2020-10-19 37889 217 0.00573

15 / 62
Compute cases per country
table1 %>%
count(country, wt = cases)

## # A tibble: 3 x 2
## country n
## * <chr> <int>
## 1 Italy 50779
## 2 Portugal 10667
## 3 Spain 53075

16 / 62
Visualise changes over time
library
library(ggplot2)
ggplot(table1, aes(date, cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country))

17 / 62
Exercises
Y. Compute the cfr for table2, and table4a + table4b.

You will need to perform four operations:

Extract the number of deaths per country per day.

Extract the number of cases per country per day.
Dihde deaths by cases.
Store back in the appropriate place.

Which representation is easiest to work with? Which is hardest? Why?

Y. Recreate the plot showing change in cases over time using table2 instead of
table1. What do you need to do irst?

18 / 62
Tentative solution for table2
Extract the number of deaths per country per day:

table_deaths <- table2 %>%

filter(variable == 'deaths') %>%
select(country, date, count) %>%
rename(deaths=count)

Extract the number of cases per country per day:

table_cases <- table2 %>%

filter(variable == 'cases') %>%
select(country, date, count) %>%
rename(cases=count)

19 / 62
Tentative solution table2
Dihde deaths by cases:

cfr = table_deaths$deaths/table_cases$cases

Store back in the appropriate place:

tidy2 <- table_cases

tidy2$deaths <- table_deaths$deaths
tidy2$cfr <- cfr

20 / 62
Tentative solution table2
tidy2

21 / 62
Recreate the plot using table2
data_plot <- table2 %>%
filter(variable=='cases') %>%
rename(cases = count)
data_plot

## # A tibble: 14 x 4
## country date variable cases
## <chr> <date> <chr> <int>
## 1 Italy 2020-10-16 cases 8803
## 2 Italy 2020-10-17 cases 10009
## 3 Italy 2020-10-18 cases 10925
## 4 Italy 2020-10-19 cases 11705
## 5 Italy 2020-10-20 cases 9337
## 6 Portugal 2020-10-16 cases 2101
## 7 Portugal 2020-10-17 cases 2608
## 8 Portugal 2020-10-18 cases 2153
## 9 Portugal 2020-10-19 cases 1856
## 10 Portugal 2020-10-20 cases 1949
## 11 Spain 2020-10-16 cases 15186
## 12 Spain 2020-10-17 cases NA
## 13 Spain 2020-10-18 cases NA
## 14 Spain 2020-10-19 cases 37889

22 / 62
Recreate the plot using table2
ggplot(data_plot, aes(date, cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country))

23 / 62
Pivoting

24 / 62
Pivoting
Main reasons for untidy data:

Y. Unfamiliarity with the principles of tidy data.

[. Data is oQen organised to facilitate some use other than analysis (ex: data
entry).

The irst step is always to igure out what the variables and observations are.

The second step is to resolve one of two common problems:

Y. One variable might be spread across multiple columns.

[. One observation might be scattered across multiple rows.

To ix these problems: pivot_longer() and pivot_wider().

25 / 62
Longer
When some of the column names are not names of variables, but values of a variable.

table4a

26 / 62
Longer
We need to pivot the oUending columns into a new pair of variables.

Three parameters:

The set of columns whose names are values, not variables.

The name of the variable to move the column names to. Here it is date.

The name of the variable to move the column values to. Here it’s cases.

The columns to pivot are speciied with dplyr::select() style notation.

27 / 62
Tabel4a
table4a %>%
pivot_longer(c(`2020-10-16`, `2020-10-17`, `2020-10-18`, `2020-10-19`, `2020-10-20`),
names_to = "date",
values_to = "cases")

## # A tibble: 15 x 3
## country date cases
## <chr> <chr> <int>
## 1 Italy 2020-10-16 8803
## 2 Italy 2020-10-17 10009
## 3 Italy 2020-10-18 10925
## 4 Italy 2020-10-19 11705
## 5 Italy 2020-10-20 9337
## 6 Portugal 2020-10-16 2101
## 7 Portugal 2020-10-17 2608
## 8 Portugal 2020-10-18 2153
## 9 Portugal 2020-10-19 1856
## 10 Portugal 2020-10-20 1949
## 11 Spain 2020-10-16 15186
## 12 Spain 2020-10-17 NA
## 13 Spain 2020-10-18 NA
## 14 Spain 2020-10-19 37889
## 15 Spain 2020-10-20 NA

28 / 62
Tabel4a
table4a %>%
pivot_longer(!country,
names_to = "date",
values_to = "cases")

29 / 62
Table4b
table4b %>%
pivot_longer(!country,
names_to = "date",
values_to = "deaths")

## # A tibble: 15 x 3
## country date deaths
## <chr> <chr> <int>
## 1 Italy 2020-10-16 83
## 2 Italy 2020-10-17 55
## 3 Italy 2020-10-18 47
## 4 Italy 2020-10-19 69
## 5 Italy 2020-10-20 73
## 6 Portugal 2020-10-16 11
## 7 Portugal 2020-10-17 21
## 8 Portugal 2020-10-18 13
## 9 Portugal 2020-10-19 19
## 10 Portugal 2020-10-20 17
## 11 Spain 2020-10-16 222
## 12 Spain 2020-10-17 NA
## 13 Spain 2020-10-18 NA
## 14 Spain 2020-10-19 217
## 15 Spain 2020-10-20 NA

30 / 62
Combine in a single tibble
tidy4a <- table4a %>%
pivot_longer(!country,
names_to = "date",
values_to = "cases")
tidy4b <- table4b %>%
pivot_longer(!country,
names_to = "date",
values_to = "deaths")

31 / 62
Combine in a single tibble
full_join(tidy4a, tidy4b)

## Joining, by = c("country", "date")

## # A tibble: 15 x 4
## country date cases deaths
## <chr> <chr> <int> <int>
## 1 Italy 2020-10-16 8803 83
## 2 Italy 2020-10-17 10009 55
## 3 Italy 2020-10-18 10925 47
## 4 Italy 2020-10-19 11705 69
## 5 Italy 2020-10-20 9337 73
## 6 Portugal 2020-10-16 2101 11
## 7 Portugal 2020-10-17 2608 21
## 8 Portugal 2020-10-18 2153 13
## 9 Portugal 2020-10-19 1856 19
## 10 Portugal 2020-10-20 1949 17
## 11 Spain 2020-10-16 15186 222
## 12 Spain 2020-10-17 NA NA
## 13 Spain 2020-10-18 NA NA
## 14 Spain 2020-10-19 37889 217
## 15 Spain 2020-10-20 NA NA

32 / 62
Wider
pivot_wider() is the opposite of pivot_longer()

You use it when an observation is scattered across multiple rows.

table2

33 / 62
Wider
Two parameters:

The column to take variable names from. Here, it’s type.

The column to take values from. Here it’s count.

table2 %>%
pivot_wider(names_from = variable, values_from = count)

## # A tibble: 14 x 4
## country date cases deaths
## <chr> <date> <int> <int>
## 1 Italy 2020-10-16 8803 83
## 2 Italy 2020-10-17 10009 55
## 3 Italy 2020-10-18 10925 47
## 4 Italy 2020-10-19 11705 69
## 5 Italy 2020-10-20 9337 73
## 6 Portugal 2020-10-16 2101 11
## 7 Portugal 2020-10-17 2608 21
## 8 Portugal 2020-10-18 2153 13
## 9 Portugal 2020-10-19 1856 19
## 10 Portugal 2020-10-20 1949 17
## 11 Spain 2020-10-16 15186 222 34 / 62
## 12 Spain 2020-10-17 NA NA
Separating and uniting

35 / 62
Separate
separate() pulls apart one column into multiple columns, by splitting wherever a
separator character appears.

table3

36 / 62
Separate
table3 %>%
separate(cfr, into = c("deaths", "cases"))

## # A tibble: 14 x 4
## country date deaths cases
## <chr> <date> <chr> <chr>
## 1 Italy 2020-10-16 83 8803
## 2 Italy 2020-10-17 55 10009
## 3 Italy 2020-10-18 47 10925
## 4 Italy 2020-10-19 69 11705
## 5 Italy 2020-10-20 73 9337
## 6 Portugal 2020-10-16 11 2101
## 7 Portugal 2020-10-17 21 2608
## 8 Portugal 2020-10-18 13 2153
## 9 Portugal 2020-10-19 19 1856
## 10 Portugal 2020-10-20 17 1949
## 11 Spain 2020-10-16 222 15186
## 12 Spain 2020-10-17 NA NA
## 13 Spain 2020-10-18 NA NA
## 14 Spain 2020-10-19 217 37889

37 / 62
Separate
By default, separate() will split values wherever it sees a non-alphanumeric character
(i.e. a character that isn't a number or letter).

If you wish to use a speciic character to separate a column, you can pass the character
to the sep argument.

table3 %>%
separate(cfr, into = c("deaths", "cases"), sep = "/")

## # A tibble: 14 x 4
## country date deaths cases
## <chr> <date> <chr> <chr>
## 1 Italy 2020-10-16 83 8803
## 2 Italy 2020-10-17 55 10009
## 3 Italy 2020-10-18 47 10925
## 4 Italy 2020-10-19 69 11705
## 5 Italy 2020-10-20 73 9337
## 6 Portugal 2020-10-16 11 2101
## 7 Portugal 2020-10-17 21 2608
## 8 Portugal 2020-10-18 13 2153
## 9 Portugal 2020-10-19 19 1856
## 10 Portugal 2020-10-20 17 1949
## 11 Spain 2020-10-16 222 15186 38 / 62
Separate
cases and deaths are character columns.

table3 %>%
separate(cfr, into = c("cases", "deaths"), convert = TRUE)

## # A tibble: 14 x 4
## country date cases deaths
## <chr> <date> <int> <int>
## 1 Italy 2020-10-16 83 8803
## 2 Italy 2020-10-17 55 10009
## 3 Italy 2020-10-18 47 10925
## 4 Italy 2020-10-19 69 11705
## 5 Italy 2020-10-20 73 9337
## 6 Portugal 2020-10-16 11 2101
## 7 Portugal 2020-10-17 21 2608
## 8 Portugal 2020-10-18 13 2153
## 9 Portugal 2020-10-19 19 1856
## 10 Portugal 2020-10-20 17 1949
## 11 Spain 2020-10-16 222 15186
## 12 Spain 2020-10-17 NA NA
## 13 Spain 2020-10-18 NA NA
## 14 Spain 2020-10-19 217 37889

39 / 62
Separate
You can also pass a vector of integers to sep.

separate() will interpret the integers as positions to split at.

positive values start at 1 on the far-leQ of the strings;

negative value start at -1 on the far-right of the strings.

table5 <- table3 %>%

separate(date, into = c("year", "-month-day"), sep = 4)
table5

## # A tibble: 14 x 4
## country year `-month-day` cfr
## <chr> <chr> <chr> <chr>
## 1 Italy 2020 -10-16 83/8803
## 2 Italy 2020 -10-17 55/10009
## 3 Italy 2020 -10-18 47/10925
## 4 Italy 2020 -10-19 69/11705
## 5 Italy 2020 -10-20 73/9337
## 6 Portugal 2020 -10-16 11/2101
## 7 Portugal 2020 -10-17 21/2608
## 8 Portugal 2020 -10-18 13/2153 40 / 62
Unite
table5 %>%
unite(date, year, `-month-day`)

## # A tibble: 14 x 3
## country date cfr
## <chr> <chr> <chr>
## 1 Italy 2020_-10-16 83/8803
## 2 Italy 2020_-10-17 55/10009
## 3 Italy 2020_-10-18 47/10925
## 4 Italy 2020_-10-19 69/11705
## 5 Italy 2020_-10-20 73/9337
## 6 Portugal 2020_-10-16 11/2101
## 7 Portugal 2020_-10-17 21/2608
## 8 Portugal 2020_-10-18 13/2153
## 9 Portugal 2020_-10-19 19/1856
## 10 Portugal 2020_-10-20 17/1949
## 11 Spain 2020_-10-16 222/15186
## 12 Spain 2020_-10-17 NA/NA
## 13 Spain 2020_-10-18 NA/NA
## 14 Spain 2020_-10-19 217/37889

41 / 62
Unite
table5 %>%
unite(date, year, `-month-day`, sep="")

## # A tibble: 14 x 3
## country date cfr
## <chr> <chr> <chr>
## 1 Italy 2020-10-16 83/8803
## 2 Italy 2020-10-17 55/10009
## 3 Italy 2020-10-18 47/10925
## 4 Italy 2020-10-19 69/11705
## 5 Italy 2020-10-20 73/9337
## 6 Portugal 2020-10-16 11/2101
## 7 Portugal 2020-10-17 21/2608
## 8 Portugal 2020-10-18 13/2153
## 9 Portugal 2020-10-19 19/1856
## 10 Portugal 2020-10-20 17/1949
## 11 Spain 2020-10-16 222/15186
## 12 Spain 2020-10-17 NA/NA
## 13 Spain 2020-10-18 NA/NA
## 14 Spain 2020-10-19 217/37889

42 / 62
Excercises
Y. What do the extra and fill arguments do in separate()? Experiment with the
various options for the following two toy datasets.

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%

separate(x, c("one", "two", "three"))
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))

Y. Both unite() and separate() have a remove argument. What does it do? Why
would you set it to FALSE?

43 / 62
Missing values

44 / 62
Missing values
A value can be missing in one of two possible ways:

Explicitly, i.e. magged with NA.

Implicitly, i.e. simply not present in the data.

stocks <- tibble(

year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)

45 / 62
Make implicit values explicit
stocks %>%
pivot_wider(names_from = year, values_from = return
return)

## # A tibble: 4 x 3
## qtr `2015` `2016`
## <dbl> <dbl> <dbl>
## 1 1 1.88 NA
## 2 2 0.59 0.92
## 3 3 0.35 0.17
## 4 4 NA 2.66

46 / 62
Turn explicit missing values implicit
stocks %>%
pivot_wider(names_from = year, values_from = return
return) %>%
pivot_longer(
cols = c(`2015`, `2016`),
names_to = "year",
values_to = "return",
values_drop_na = TRUE
)

## # A tibble: 6 x 3
## qtr year return
## <dbl> <chr> <dbl>
## 1 1 2015 1.88
## 2 2 2015 0.59
## 3 2 2016 0.92
## 4 3 2015 0.35
## 5 3 2016 0.17
## 6 4 2016 2.66

47 / 62
Make implicit values explicit 2
stocks %>%
complete(year, qtr)

## # A tibble: 8 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 1 NA
## 6 2016 2 0.92
## 7 2016 3 0.17
## 8 2016 4 2.66

48 / 62
Complete missing values
treatment <- tribble(
~ person, ~ treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)

49 / 62
Complete missing values
treatment %>%
fill(person)

## # A tibble: 4 x 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Derrick Whitmore 1 7
## 2 Derrick Whitmore 2 10
## 3 Derrick Whitmore 3 9
## 4 Katherine Burke 1 4

50 / 62
Case Study

51 / 62
Case Study
data_pt

## # A tibble: 238 x 88
## data data_dados confirmados confirmados_ars…
## <chr> <chr> <dbl> <dbl>
## 1 26-0… 26-02-202… 0 0
## 2 27-0… 27-02-202… 0 0
## 3 28-0… 28-02-202… 0 0
## 4 29-0… 29-02-202… 0 0
## 5 01-0… 01-03-202… 0 0
## 6 02-0… 02-03-202… 2 2
## 7 03-0… 03-03-202… 4 2
## 8 04-0… 04-03-202… 6 3
## 9 05-0… 05-03-202… 9 5
## 10 06-0… 06-03-202… 13 8
## # … with 228 more rows, and 84 more variables:
## # confirmados_arscentro <dbl>, confirmados_arslvt <dbl>,
## # confirmados_arsalentejo <dbl>,
## # confirmados_arsalgarve <dbl>, confirmados_acores <dbl>,
## # confirmados_madeira <dbl>, confirmados_estrangeiro <dbl>,
## # confirmados_novos <dbl>, recuperados <dbl>, obitos <dbl>,
## # internados <dbl>, internados_uci <dbl>, lab <dbl>,
## # suspeitos <dbl>, vigilancia <dbl>, n_confirmados <dbl>,
## # cadeias_transmissao <dbl>, transmissao_importada <dbl>,
## # confirmados_0_9_f <dbl>, confirmados_0_9_m <dbl>,
## # confirmados_10_19_f <dbl>, confirmados_10_19_m <dbl>, 52 / 62
## # confirmados_20_29_f <dbl>, confirmados_20_29_m <dbl>,
Case Study
data_pt1 <- data_pt %>%
select(data,
confirmados_0_9_f:confirmados_80_plus_m,
obitos_0_9_f:obitos_80_plus_m)
data_pt1

## # A tibble: 238 x 37
## data confirmados_0_9… confirmados_0_9… confirmados_10_…
## <chr> <dbl> <dbl> <dbl>
## 1 26-0… NA NA NA
## 2 27-0… NA NA NA
## 3 28-0… NA NA NA
## 4 29-0… NA NA NA
## 5 01-0… NA NA NA
## 6 02-0… NA NA NA
## 7 03-0… 0 0 0
## 8 04-0… 0 0 0
## 9 05-0… 0 0 0
## 10 06-0… 0 0 0
## # … with 228 more rows, and 33 more variables:
## # confirmados_10_19_m <dbl>, confirmados_20_29_f <dbl>,
## # confirmados_20_29_m <dbl>, confirmados_30_39_f <dbl>,
## # confirmados_30_39_m <dbl>, confirmados_40_49_f <dbl>,
## # confirmados_40_49_m <dbl>, confirmados_50_59_f <dbl>,
## # confirmados_50_59_m <dbl>, confirmados_60_69_f <dbl>,
## # confirmados_60_69_m <dbl>, confirmados_70_79_f <dbl>, 53 / 62
## # confirmados_70_79_m <dbl>, confirmados_80_plus_f <dbl>,
Case Study
Tidy the dataset
Make a plot of the evolution of deaths over time by age group

54 / 62
Tidying
data_pt1 <- data_pt1 %>%
pivot_longer(
cols = c(confirmados_0_9_f:confirmados_80_plus_m,
obitos_0_9_f:obitos_80_plus_m),
names_to = "key",
values_to = "count",
values_drop_na = TRUE
)

55 / 62
Tidying
data_pt1 <- data_pt1 %>%
separate(key,
c("type",
"age_from",
"age_to",
"gender"),
sep = "_")

data_pt1 <- data_pt1 %>%

unite(age, age_from, age_to, sep="-")

56 / 62
Tidying
data_pt1 <- data_pt1 %>%
pivot_wider(
names_from = "type",
values_from = "count"
)

data_pt1$data <- as.Date(data_pt1$data, format='%d-%m-%Y')

57 / 62
Plot
ggplot(filter(data_pt1, gender=='m'),
aes(data, obitos, col=age)) +
geom_line()

58 / 62
Non-tidy data

59 / 62
Non-tidy data
Main reasons to use non-tidy data structures:

Alternative representations may have substantial performance or space

advantages.
Specialised ields have evolved their own conventions.

If data does it naturally into a rectangular structure composed of observations and

variables, tidy data should be the default choice.

Tidy data is not the only way.

60 / 62
Resources

61 / 62
Resources
R for Data Science

Tidy data

Tidy data paper

Non-tidy data

62 / 62

Factors
No ratings yet
Factors
23 pages
Unit 2 Assignment CGC1W
No ratings yet
Unit 2 Assignment CGC1W
2 pages
Vedic Numerology Course Guide
91% (64)
Vedic Numerology Course Guide
114 pages
Review Answers: Your Answer
50% (2)
Review Answers: Your Answer
3 pages
Risk Assessment For General Activities
75% (4)
Risk Assessment For General Activities
25 pages
Daily Lesson Log of M8Al-Ib-2 (Week 2 Day 3) : Can The Difference of Two Squares Be Applicable To 3 - 12 If No, Why?
No ratings yet
Daily Lesson Log of M8Al-Ib-2 (Week 2 Day 3) : Can The Difference of Two Squares Be Applicable To 3 - 12 If No, Why?
4 pages
Raisen PDF
No ratings yet
Raisen PDF
99 pages
U.S. Foreign Assistance To Somalia - Phoenix From The Ashes
No ratings yet
U.S. Foreign Assistance To Somalia - Phoenix From The Ashes
26 pages
HR Interview Questions
No ratings yet
HR Interview Questions
8 pages
Lesson 3 Four Pillars of Education
No ratings yet
Lesson 3 Four Pillars of Education
40 pages
Service Manual: Viewsonic Pjd6211
No ratings yet
Service Manual: Viewsonic Pjd6211
60 pages
DELTA V PDS S-Series Horizontal Carriers
No ratings yet
DELTA V PDS S-Series Horizontal Carriers
7 pages
Consent Document For Enrolling Adult Participants in A Research Study
No ratings yet
Consent Document For Enrolling Adult Participants in A Research Study
3 pages
IIT JEE Organic Chemistry Solutions
100% (3)
IIT JEE Organic Chemistry Solutions
15 pages
Camay Relaunch in Pakistan
100% (1)
Camay Relaunch in Pakistan
26 pages
Boarding Pass: Name Booking Code Ticket No
No ratings yet
Boarding Pass: Name Booking Code Ticket No
1 page
Chapter 7 Input Tax Credit Under GST
No ratings yet
Chapter 7 Input Tax Credit Under GST
28 pages
Nursing Body Mechanics Guide
No ratings yet
Nursing Body Mechanics Guide
66 pages
Parle Products List
100% (3)
Parle Products List
5 pages
Modern Programming Tools and Techniques: DCAP505
No ratings yet
Modern Programming Tools and Techniques: DCAP505
28 pages
Lexicology Study Guide
No ratings yet
Lexicology Study Guide
34 pages
COVID 19 Some Challenges Some Data 1
No ratings yet
COVID 19 Some Challenges Some Data 1
26 pages
Siemens PBX & Cisco CallManager Guide
No ratings yet
Siemens PBX & Cisco CallManager Guide
37 pages
Covid19 TD
No ratings yet
Covid19 TD
729 pages
Data Visualization in Data Science Using R
No ratings yet
Data Visualization in Data Science Using R
37 pages
R Functions
No ratings yet
R Functions
8 pages
Data Cleansing Using R
0% (1)
Data Cleansing Using R
10 pages
FU5 P3 DTy Qeg 5 O4 Ne FKWG
No ratings yet
FU5 P3 DTy Qeg 5 O4 Ne FKWG
18 pages
U.S.S. Europa Starship Specs
No ratings yet
U.S.S. Europa Starship Specs
1 page
Assignment 3
No ratings yet
Assignment 3
16 pages
R Date Conversion Guide
No ratings yet
R Date Conversion Guide
16 pages
Region and Domain Region and Domain
No ratings yet
Region and Domain Region and Domain
3 pages
RLB Construction Market Update Vietnam Q2 2018
No ratings yet
RLB Construction Market Update Vietnam Q2 2018
8 pages
Data Tidying With Tidyr::: Cheat Sheet
No ratings yet
Data Tidying With Tidyr::: Cheat Sheet
2 pages
SanyaMidha FullStackWebDeveloper Resume
100% (1)
SanyaMidha FullStackWebDeveloper Resume
1 page
Data Analytics Assignment 1
No ratings yet
Data Analytics Assignment 1
11 pages
Worksheet - Chapter 11 - Biotechnology - Principles and Processes
No ratings yet
Worksheet - Chapter 11 - Biotechnology - Principles and Processes
3 pages
Preprocessing - Preprocessing Your Data With R
No ratings yet
Preprocessing - Preprocessing Your Data With R
23 pages
COVID
No ratings yet
COVID
19 pages
Eda 21524785
No ratings yet
Eda 21524785
32 pages
R Course Own English HS
No ratings yet
R Course Own English HS
70 pages
LSD Thesis Statement
100% (3)
LSD Thesis Statement
5 pages
Covid19 Visualization
No ratings yet
Covid19 Visualization
2 pages
Intro To Data Science Lecture 4
No ratings yet
Intro To Data Science Lecture 4
13 pages
Important R Codes and Notes
No ratings yet
Important R Codes and Notes
13 pages
EXAM1 - Muhibbul Arman Mannan: List Ls
No ratings yet
EXAM1 - Muhibbul Arman Mannan: List Ls
13 pages
DR - Pierpaolo-Delser - Introduction R
No ratings yet
DR - Pierpaolo-Delser - Introduction R
83 pages
Mlda DD
No ratings yet
Mlda DD
17 pages
R Cheat Sheets for ECON1267
No ratings yet
R Cheat Sheets for ECON1267
13 pages
Interactive Visualization of COVID-19 Data and Animated Map: Some Instructions
No ratings yet
Interactive Visualization of COVID-19 Data and Animated Map: Some Instructions
6 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
Experiment No 8
No ratings yet
Experiment No 8
11 pages
4.18 Data Wrangling Slides Part1
No ratings yet
4.18 Data Wrangling Slides Part1
54 pages
Unit Ii
No ratings yet
Unit Ii
17 pages
R Basic and Advanced
No ratings yet
R Basic and Advanced
9 pages
Tidy Data
No ratings yet
Tidy Data
27 pages
COMP2501 - Assignment - 1 - Questions - RMD 2
No ratings yet
COMP2501 - Assignment - 1 - Questions - RMD 2
7 pages
Pyr Agossou FR
No ratings yet
Pyr Agossou FR
12 pages
Data Analysis and Interpretation Template
No ratings yet
Data Analysis and Interpretation Template
7 pages
R-Plots - HOW TO
No ratings yet
R-Plots - HOW TO
4 pages
Practical 1 EDA
No ratings yet
Practical 1 EDA
14 pages
Lecture 3
No ratings yet
Lecture 3
53 pages
1 - Tidying Data - R - Primary
No ratings yet
1 - Tidying Data - R - Primary
13 pages
Master Thesis - BUILDING A RISK MODEL FOR OIL & GAS - Submitted by Himanshu Singh
No ratings yet
Master Thesis - BUILDING A RISK MODEL FOR OIL & GAS - Submitted by Himanshu Singh
56 pages
My P Report
No ratings yet
My P Report
14 pages
Comp Lab 2 GunExample 2425
No ratings yet
Comp Lab 2 GunExample 2425
15 pages
02-Data Gathering and Preparation
No ratings yet
02-Data Gathering and Preparation
54 pages
HW 4
No ratings yet
HW 4
12 pages
Visualizing Data in R
No ratings yet
Visualizing Data in R
20 pages
Anticipation Guide-Phonics and Word Recognition
No ratings yet
Anticipation Guide-Phonics and Word Recognition
5 pages
R Programming Cheat Sheet
No ratings yet
R Programming Cheat Sheet
7 pages
Data Visualization and Communication Introduction
No ratings yet
Data Visualization and Communication Introduction
14 pages
Week6 Slides Updated
No ratings yet
Week6 Slides Updated
57 pages
R Training AM
No ratings yet
R Training AM
6 pages
Week12 Slides
No ratings yet
Week12 Slides
46 pages
Unit 2
No ratings yet
Unit 2
76 pages
Advanced R Programming Tidyverse Packages Notes
No ratings yet
Advanced R Programming Tidyverse Packages Notes
12 pages
KrutikaKolhe 862467252 HW5
No ratings yet
KrutikaKolhe 862467252 HW5
18 pages
Week3 Cheat Sheet Exploratory Data Analysis
No ratings yet
Week3 Cheat Sheet Exploratory Data Analysis
3 pages
DAC Phase4
No ratings yet
DAC Phase4
6 pages
Tidy Data
No ratings yet
Tidy Data
4 pages
Comprehensive Analysis and Visualization of Mixed Data
No ratings yet
Comprehensive Analysis and Visualization of Mixed Data
22 pages
Fda SSIGNMENT 02
No ratings yet
Fda SSIGNMENT 02
13 pages
Exploratory Data Analysis and Visualization
No ratings yet
Exploratory Data Analysis and Visualization
10 pages
Assignment R
No ratings yet
Assignment R
6 pages
R Guru Cheat Sheet
No ratings yet
R Guru Cheat Sheet
2 pages
Week 5 Data Wrangling
No ratings yet
Week 5 Data Wrangling
96 pages