20CS2058 / BASICS OF DATA
ANALYTICS - R
PROGRAMMING AND TABLEAU
Module 2
Tidy Data with tidyr
Introduction
• represent the same underlying data in multiple ways.
• Each dataset shows the same values of four variables, country,
year, population, and cases, but each dataset organizes the
values in a different way:
• table1
• table2
• table3
• table4a
• table4b
Interrelated rules
• There are three interrelated rules which make a dataset tidy:
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell
Example1
#Compute rate per 10,000
• table1 %>%
mutate(rate = cases / population * 10000)
OUTPUT
Example2
• # Compute cases per year
• table1 %>%
count(year, wt = cases)
• table1 %>%
count(year, cases)
OUTPUT
Visualization
• library(ggplot2)
• ggplot(table1, aes(year, cases)) +
geom_line(aes(group = country),color =“red") +
geom_point(aes(color = country))
Output
Gathering
• tidy4a <- table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
• tidy4b <- table4b %>%
gather(`1999`, `2000`, key = "year", value =
"population")
• left_join(tidy4a, tidy4b)
Output
Cntd…
Spreading
• Spreading is the opposite of gathering.
• You use it when an observation is scattered across
multiple rows.
• Example:
spread(table2, key = type, value = count)
Output
Cntd…
gather() vs spread()
• gather() makes wide tables narrower and longer;
spread() makes long tables shorter and wider.
Separate()
• pulls apart one column into multiple columns, by
splitting wherever a separator character appears
Example
• table3 %>%
separate(rate, into = c("cases", "population"))
Cntd..
Rewrite the preceding code
• If you wish to use a specific character to separate a
column, you can pass the character to the sep
argument of separate().
• table3 %>%
separate(rate, into = c("cases", "population"), sep = "/“)
Convert to better types
• table3 %>%
separate(
rate,
into = c("cases", "population"),
convert = TRUE
)
OUTPUT
Cntd…
• When using integers to separate strings, the length of
sep should be one less than the number of names in
into.
Unite
• is the inverse of separate():
• it combines multiple columns into a single column
• can use unite() to rejoin the century and year columns
that we created in the last example.
Example
Cntd…
Example
• The default will place an underscore (_)
between the values from different columns.
• We can also specify “”
• table5 %>% unite(new, century, year, sep =
"")