DSA2101
Essential Data Analytics Tools: Data Visualization
Yuting Huang
AY24/25
Week 12 Exploring data through visualization
1 / 46
Midterm Exam: Feedback
The grades and feedback are available on Canvas:
▶ Median = 23.28, mean = 22.54, std = 6.45.
▶ Some of you did really well: > 27.5/30.
▶ If you scored below 12.75, please schedule a meeting with
your TA to review the midterm.
2 / 46
Final Exam: Date and time
The final exam worth 40% of your grade.
▶ Time: May 6th 9-11am
▶ Venue: MPSH1A
▶ Open book, open notes, block internet exam on Examplify.
▶ R packages required: readxl, stringr, lubridate,
tidyverse.
▶ Data files will be available on Canvas 15 minutes before the
exam.
3 / 46
Final Exam: Question format
The exam consists of
▶ Part I: Multiple choice + Fill-in-the-blank questions (20
marks).
▶ Answer questions directly on Examplify. No submission of
R code is needed.
▶ Part II: Coding questions (20 marks).
▶ Answer questions in a single Rmd file and submit it on both
Examplify and Canvas.
4 / 46
Submission requirements
At the end of the exam at 11am:
▶ Copy and paste the entire Rmd to an Examplify text box.
Submit the exam on Examplify.
▶ Then upload your Rmd to Canvas before 11:15am.
▶ Ensure that the code submission on both Examplify and
Canvas is identical, with exception of indentation and
alignment. Any discrepancy will be flagged and penalized.
▶ After successful submissions, please keep your Examplify
green confirmation window and the Canvas submission
page open for invigilators to verify.
5 / 46
Explore data through visualization
Visualization is an integral part of exploratory data analysis
(EDA).
▶ It is a highly iterative process. We should expect to:
▶ Generate questions about our data.
▶ Search for answers by visualizing, transforming, and
modeling out data.
▶ Use what we learn to refine our questions and/or generate
new questions.
6 / 46
coronavirus data
Let’s work with a data set on the daily summary of Covid-19
cases, deaths, and recovery for Asian countries and cities.
library(tidyverse)
theme_set(theme_minimal())
coronavirus <- read.csv("../data/wk12_coronavirus.csv") %>%
select(-X) %>% mutate(date = ymd(date))
glimpse(coronavirus)
## Rows: 154,305
## Columns: 4
## $ country <chr> "Afghanistan", "Afghanistan", "Afghanistan",
## $ date <date> 2020-01-22, 2020-01-22, 2020-01-22, 2020-01-
## $ type <chr> "confirmed", "death", "recovery", "confirmed"
## $ cases <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
7 / 46
The data contain information on Covid-19 cases for 45 Asian
countries and cities.
▶ Let’s examine confirmed cases for selected countries in
2020.
selected_countries <- c("Singapore", "Malaysia", "Indonesia")
df1 <- coronavirus %>%
filter(country %in% selected_countries,
type == "confirmed", date <= "2020-12-31")
8 / 46
Naturally, we can visualize the daily confirmed cases with line
charts.
df1_text <- df1 %>% filter(date == "2020-12-31")
ggplot(df1, aes(x = date, y = cases, color = country)) +
geom_line() +
geom_text(data = df1_text, aes(label = country),
hjust = "left", nudge_x = 2, size = 3) +
labs(x = "", y = "Confirmed cases") +
theme(legend.position = "none") +
scale_x_date(limits = as.Date(c("2020-01-01", "2021-01-31")))
8000 Indonesia
Confirmed cases
6000
4000
Malaysia
2000
0 Singapore
Jan 2020 Apr 2020 Jul 2020 Oct 2020 Jan 2021
9 / 46
The layer facet_wrap() creates small multiples (i.e., faceted
plots) based on a categorical variable.
▶ Each subplot shows a subset of the data.
▶ By default, it also keeps the scales of the axes fixed, for
easier comparison.
ggplot(df1, aes(x = date, y = cases, color = country)) +
geom_line() +
facet_wrap(~ country) +
labs(x = "", y = "Confirmed cases") +
theme(legend.position = "none") +
scale_x_date(limits = as.Date(c("2020-01-01", "2021-01-01")),
date_breaks = "6 months", date_labels = "%Y-%b")
10 / 46
ggplot(df1, aes(x = date, y = cases, color = country)) +
geom_line() +
facet_wrap(~ country) +
labs(x = "", y = "Confirmed cases") +
theme(legend.position = "none") +
scale_x_date(limits = as.Date(c("2020-01-01", "2021-01-01")),
date_breaks = "6 months", date_labels = "%Y-%b")
Indonesia Malaysia Singapore
8000
Confirmed cases
6000
4000
2000
0
2020−Jun 2020−Dec 2020−Jun 2020−Dec 2020−Jun 2020−Dec
11 / 46
We can also visualize the data will be using a tile chart (heat
map).
▶ A numeric variable is mapped to a continuous fill scale.
ggplot(df1, aes(x = date, y = country)) +
geom_tile(aes(fill = cases/1000)) +
scale_fill_gradient(low = "white", high = "maroon") +
labs(title = "Confirmed cases in 2020",
fill = "Cases (thousands)", x = "", y = "") +
theme(legend.position = "top")
Confirmed cases in 2020
Cases (thousands)
0 2 4 6 8
Singapore
Malaysia
Indonesia
Apr 2020 Jul 2020 Oct 2020 Jan 2021 12 / 46
Alternatively, we can aggregate daily counts to monthly total,
and visualize the overall trends.
df2 <- df1 %>%
mutate(month = month(date, label = TRUE, abbr = TRUE)) %>%
group_by(country, month) %>%
summarize(cases = sum(cases), .groups = "drop")
glimpse(df2)
## Rows: 36
## Columns: 3
## $ country <chr> "Indonesia", "Indonesia", "Indonesia", "Indon
## $ month <ord> Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep,
## $ cases <int> 0, 0, 1528, 8590, 16355, 29912, 51991, 66420,
13 / 46
ggplot(df2, aes(x = month, y = country)) +
geom_tile(aes(fill = cases/1000)) +
scale_fill_gradient(low = "white", high = "maroon") +
labs(title = "Confirmed cases in 2020",
fill = "Cases (thousands)", x = "", y = "") +
theme(legend.position = "top")
Confirmed cases in 2020
Cases (thousands)
0 50 100 150 200
Singapore
Malaysia
Indonesia
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
14 / 46
ggplot(df2, aes(x = month, y = country)) +
geom_tile(aes(fill = cases/1000), show.legend = FALSE) +
geom_text(aes(label = cases), size = 2.5) +
scale_fill_gradient(low = "white", high = "maroon") +
labs(x = "", y = "", title="Monthly confirmed cases in 2020")
Monthly confirmed cases in 2020
Singapore 13 89 824 15243 18715 9023 8298 4607 953 250 203 381
Malaysia 8 21 2737 3236 1817 820 337 364 1884 20324 34149 47313
Indonesia 0 0 1528 8590 16355 29912 51991 66420 112212 123080 128795 204315
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
15 / 46
Colors
Colors can be thought of as a three-dimensional concept
consisting of
▶ Hue: Red, green, blue, . . .
▶ Saturation: The purity of light, e.g., dull versus vivid.
▶ Brightness: The amount of light present, e.g., light versus
dark.
Apart from making graphs prettier and more pleasant to look
at, colors add solid functionality to visual representations.
1. Distinguish between different categorical groups.
2. Distinguish between the magnitude of continuous values.
16 / 46
Types of color palettes
These functionalities roughly correspond to different types of
color palettes.
▶ Qualitative color
palettes for categorical
data.
▶ To highlight distinction
across groups.
▶ Sequential color palettes
for continuous data.
▶ Use increasing intensity
or saturation to
indicate larger values.
17 / 46
Types of color palettes
▶ Diverging color palettes
for data with a central
neutral value.
▶ To put equal emphasis
on extreme values at
both ends of the data
range.
▶ The value in the
middle is represented
by lighter colors.
18 / 46
Base R colors
To gain control over colors, we first need to define colors or
color palettes.
▶ Base R comes with 657 predefined colors.
▶ We can call them by names: col = "steelblue"
▶ The default color palette in R (using version 4.4.2):
palette()
## [1] "black" "#DF536B" "#61D04F" "#2297E6" "#28E2E5" "#CD0BB
## [8] "gray62"
19 / 46
HEX color codes
The palette() command returns colors in 6-digit HEX
(hexadecimal) codes.
▶ The six digits indicate the color.
▶ Black is #000000 and white is #FFFFFF.
▶ We can add two digits at the end to encode the degree of
opacity.
20 / 46
RGB color codes
Colors can also be represented using RGB (red-green-blue)
codes.
▶ An additive color model
used for screens.
▶ Each code is specified
with three parameters,
defining the intensity of
the color as an integer
between 0 and 255.
▶ rgb(0, 0, 0) is black.
▶ rgb(255, 255, 255) is
white.
21 / 46
Using color packages
Most visualization packages, like ggplot2, provide their own
color palettes.
There is also a large number of R packages that supply
additional color support.
▶ ggthemes provides useful palettes such as Tableau color
palettes.
▶ viridis can be perceived by readers with the most
common forms of color blindness.
▶ RColorBrewer provides a vibrant color palettes that are
also widely used in the R community.
22 / 46
ggthemes palettes
There are various color palettes available in ggthemes.
▶ Here’s one of those used in Tableau.
Classic 10
#1f77b4 #ff7f0e #2ca02c #d62728
#9467bd #8c564b #e377c2 #7f7f7f
#bcbd22 #17becf
23 / 46
Color blindness and the viridis paletes
A sizable proportion of population can only distinguish fewer
colors than others.
▶ Here’s how the base R palette would appear under different
form of color blindness.
base R palette viridis palette
normal vision normal vision
deuteranope deuteranope
protanope protanope
desaturate desaturate
24 / 46
RColorBrewer palettes
YlOrRd
YlOrBr
YlGnBu
YlGn
Reds
RdPu
Purples
PuRd
PuBuGn
PuBu
OrRd
Oranges
Greys
Greens
GnBu
BuPu
BuGn
Blues
Set3
Set2
Set1
Pastel2
Pastel1
Paired
Dark2
Accent
Spectral
RdYlGn
RdYlBu
RdGy
RdBu
PuOr
PRGn
PiYG
BrBG
25 / 46
Custom colors
▶ Specify a single color to a geom:
▶ Use color or fill to a specific color outside of aes().
▶ Assign colors by a variable in data:
▶ Map color or fill to the variable of interest.
▶ Set custom color palettes, for example:
▶ scale_*_manual() for custome a set of colors.
▶ Additional color packages such as viridis, RColorBrewer,
and ggthemes.
26 / 46
Example
Let us continue to use the Coronavirus data set, now we focus
on cases in Singapore in 2020.
df_sg <- coronavirus %>%
mutate(date = ymd(date)) %>%
filter(country == "Singapore", date <= "2020-12-31",
type %in% c("confirmed", "recovery"))
head(df_sg)
## country date type cases
## 1 Singapore 2020-01-22 confirmed 0
## 2 Singapore 2020-01-22 recovery 0
## 3 Singapore 2020-01-23 confirmed 1
## 4 Singapore 2020-01-23 recovery 0
## 5 Singapore 2020-01-24 confirmed 2
## 6 Singapore 2020-01-24 recovery 0
27 / 46
p1 <- ggplot(df_sg, aes(x = date, y = cases, color = type)) +
geom_line(lwd = 1) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
labs(x = "", y = "Cases", color = "",
title = "Confirmed and recovered cases in Singapore, 2020") +
theme(legend.position = "top")
p1
Confirmed and recovered cases in Singapore, 2020
confirmed recovery
1000
Cases
500
0
Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
28 / 46
Manually specified colors
▶ Here we specify a discrete color scale with
scale_color_manual():
p1 +
scale_color_manual(values = c("maroon", "gray"))
Confirmed and recovered cases in Singapore, 2020
confirmed recovery
1000
Cases
500
0
Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
29 / 46
▶ The same data with geom_area(). Notice that type is now
mapped to the fill aesthetics.
p2 <- ggplot(df_sg, aes(x = date, y = cases, fill = type)) +
geom_area() +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
labs(x = "", y = "Confirmed cases", color = "",
title = "Confirmed cases in Singapore, 2020") +
theme(legend.position = "top") +
scale_fill_manual(values = c("maroon", "gray"))
p2
Confirmed cases in Singapore, 2020
type confirmed recovery
2000
1500
Confirmed cases
1000
500
0
Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
30 / 46
For RColorBrewer palettes, we use scale_fill_brewer():
▶ The palette argument specifies the name of the palette.
library(RColorBrewer)
p2 + scale_fill_brewer(palette = "Set3")
Confirmed cases in Singapore, 2020
type confirmed recovery
2000
1500
Confirmed cases
1000
500
0
Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
31 / 46
▶ Moreover, a viridis palette with scale_fill_viridis().
library(viridis)
p2 + scale_fill_viridis(option = "viridis", discrete = TRUE)
Confirmed cases in Singapore, 2020
type confirmed recovery
2000
1500
Confirmed cases
1000
500
0
Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
32 / 46
More on accessibility
Apart from inclusive colors, we can also include Alt Text
(alternative text).
▶ The goal is to make our visuals more accessible to everyone.
▶ Used in HTML pages, often displayed in place of or below
the figure.
Source: Mary Cesal.
33 / 46
Examples
34 / 46
Examples
35 / 46
▶ In RMarkdown, one way to include Alt Text is through
fig.cap local code chunk option.
▶ The text will be displayed below the figure.
Daily confirmed cases Singapore, 2020
1000
500
Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
Line chart that shows daily COVID-19 daily confirmed cases in
Singapore in 2020, where there is a prominent peak in late April; the
daily cases drops sharply after August and remains low for the rest of
the year.
36 / 46
Summary on ggplot2 functions
▶ Here are some of the geom functions we learned so far.
ggplot + geom_point + geom_smooth + + geom_histogram + geom_density
geom_hline (/vline)
+ geom_line + + geom_col (/bar) + geom_polygon + geom_tile + geom_area
geom_text (/label) (/sf)
text
label
37 / 46
Visualization principles
Now we are equipped with the visualization functions, we shall
revisit some of the principles covered in Week 9.
1. Include the baseline (typically 0).
38 / 46
2. Pie charts (best to avoid them).
39 / 46
3. Partial transparency and jiterring (to handle overplotting).
40 / 46
4. Color coding (sequential palette for continuous variable).
41 / 46
4. Color coding (diverging palette for data with a meaningful
midpoint).
42 / 46
5. Small multiples (same scale on the axis).
43 / 46
In Week 10, we discussed three graphs in the wild:
44 / 46
Your turn
The following data sets are available on canvas:
▶ wk12_streamingUS.csv
▶ wk12_cereal_consumption.xlsx
▶ wk12_time_use.csv
Identify the geoms used and re-create the plots in ggplot2.
Explore the data and think about possible ways to
revise/improve the visualizations.
45 / 46
Plans in Week 13
Week 13:
▶ Review session on Monday. There will be no lecture on
Wednesday.
▶ Tutorials as usual.
▶ Wrap up your group project and submit it by the extended
due date: Saturday, April 19 11:59pm.
▶ Only one submission is required for each group.
46 / 46