DSA2101
Essential Data Analytics Tools: Data Visualization
Yuting Huang
AY24/25
Week 11 Introduction to ggplot2
1 / 27
Re-cap: Choosing the right plot
There are many geoms available in the ggplot2 package.
The choice of which one to use largely depends on two questions:
▶ What are you trying to communicate?
▶ What type of variable(s) do you want to show?
Source: Adapted from John F. Ouyang.
2 / 27
Prerequisites
▶ ggplot2 is included in tidyverse.
library(tidyverse)
Artwork by Allison Horst
3 / 27
Outline
1. Aesthetics and geometrical objects
▶ Scatterplot
▶ Smoother line
▶ Histogram and density plot
▶ Line plot
▶ Text annotations
▶ Bar plot
▶ Maps
2. Miscellaneous tasks
▶ Themes
▶ Layouts
▶ Common layers
4 / 27
Bar plot
We use a bar plot to visualize categorical variables.
▶ geom_col() creates bars where the height directly represents
values in the data.
▶ geom_bar() creates bars based on the count of observations in
each group – the counts are obtained through an internal
aggregation.
Some of the aesthetics that these geom functions use are:
▶ x (required)
▶ y (not required by geom_bar())
▶ color
▶ fill
▶ width
5 / 27
Bar plot: geom_col()
Let’s continue working on the murders.csv data set.
▶ Here’s a bar chart on the number of states in each region.
murders <- read.csv("../data/murders.csv")
state_by_region <- murders %>% count(region)
ggplot(state_by_region, aes(x = region, y = n)) +
geom_col()
15
10
n
North Central Northeast South West
region
6 / 27
Bar plot: geom_bar()
Alternatively, we can use geom_bar() to visualize the data.
▶ The function automatically counts the number of observations for
each x value – there’s no need to summarize the data beforehand.
ggplot(murders, aes(x = region)) +
geom_bar()
15
10
count
North Central Northeast South West
region
7 / 27
▶ By default, geom_bar() uses stat = "count", which means to
count the number of observations in each group.
▶ To use the values directly from the data, set stat = "identity".
ggplot(state_by_region, aes(x = region, y = n)) +
geom_bar(stat = "identity")
15
10
n
North Central Northeast South West
region
8 / 27
Maps: geom_polygon()
Plotting geo-spatial data is a common visualization task.
▶ The simplest way to draw maps is to use geom_polygon().
▶ We will need the latitude and longitude of the boundaries for
different regions.
▶ For US states, we can obtain the data from the maps package.
# install.packages("maps")
library(maps)
us_states <- map_data("state")
9 / 27
Maps: geom_polygon()
In the object us_states, we have the following variables:
▶ lat and long specify the latitude and longitude of the corners of
a polygon.
▶ group provides a unique id for each region.
▶ order provides the drawing order of boundary points.
glimpse(us_states)
## Rows: 15,537
## Columns: 6
## $ long <dbl> -87.46201, -87.48493, -87.52503, -87.53076, -87.570
## $ lat <dbl> 30.38968, 30.37249, 30.37249, 30.33239, 30.32665, 3
## $ group <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
## $ order <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
## $ region <chr> "alabama", "alabama", "alabama", "alabama", "alabam
## $ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
10 / 27
Let’s first visualize the data using geom_point().
▶ Each row in the data is plotted as a single point – forming a
scatterplot that effectively shows the outline of every state.
ggplot(data = us_states, aes(x = long, y = lat)) +
geom_point(size = 0.25)
50
45
40
lat
35
30
25
−120 −100 −80
long
11 / 27
Maps: geom_polygon()
Now let’s turn this scatter plot into a map using geom_polygon().
▶ group specifies how to connect the coordinates into polygons.
▶ The order column is also used internally.
ggplot(data = us_states, aes(x = long, y = lat, group = group)) +
geom_polygon(color = "white", fill = "lightblue")
50
45
40
lat
35
30
25
−120 −100 −80
long
12 / 27
▶ Add data to fill each state according to its population. We shall
continue using the gun murders data, murders.csv.
▶ The first step is to merge the two data sets.
murders <- read.csv("../data/murders.csv")
df <- murders %>%
mutate(state = tolower(state)) %>%
left_join(us_states, by = c("state" = "region"))
glimpse(df)
## Rows: 15,539
## Columns: 10
## $ state <chr> "alabama", "alabama", "alabama", "alabama", "alaba
## $ abb <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "A
## $ region <chr> "South", "South", "South", "South", "South", "Sout
## $ population <int> 4779736, 4779736, 4779736, 4779736, 4779736, 47797
## $ total <int> 135, 135, 135, 135, 135, 135, 135, 135, 135, 135,
## $ long <dbl> -87.46201, -87.48493, -87.52503, -87.53076, -87.57
## $ lat <dbl> 30.38968, 30.37249, 30.37249, 30.33239, 30.32665,
## $ group <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
## $ order <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
## $ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
13 / 27
Maps: geom_polygon()
ggplot(df, aes(x = long, y = lat, group = group, fill = region)) +
geom_polygon(color = "white") +
theme(legend.position = "top")
region North Central Northeast South West
50
45
40
lat
35
30
25
−120 −100 −80
long
14 / 27
Maps: geom_polygon()
ggplot(df, aes(x = long, y = lat, group = group, fill=population/1e6))+
geom_polygon(color = "white") +
theme(legend.position = "top")
population/1e+06
10 20 30
50
45
40
lat
35
30
25
−120 −100 −80
long
15 / 27
▶ Once a map is created, we often need to modify the color
schemes.
▶ . . . with scale_fill_continuous() in this example.
▶ Also, theme_void() modifies the theme of the visualization.
ggplot(df, aes(x = long, y = lat, group = group, fill=population/1e6))+
geom_polygon(color = "white") +
scale_fill_continuous(name = "Population (millions)",
low = "lightgray", high = "steelblue") +
theme_void() +
theme(legend.position = "top")
Population (millions)
10 20 30
16 / 27
More on maps
Other maps in the map_data() function:
▶ Countries: usa, france, italy, nz
▶ Within the US: county, state
▶ World: world, world2
There are other packages and methods for maps. But you will need to
do your research to look for geographic information that defines the
map boundaries.
17 / 27
Singapore planning regions
The file sg_masterplan2019.rds contains Singapore’s planning area
boundary in 2019.
▶ The original data come from the Urban Redevelopment
Authority.
▶ We will need the sf package before loading the data.
# install.packages("sf")
library(sf)
sg_map <- readRDS("../data/sg_masterplan2019.rds")
class(sg_map)
## [1] "sf" "tbl_df" "tbl" "data.frame"
glimpse(sg_map)
## Rows: 55
## Columns: 2
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((103.9321 1...., MULTIPO
## $ town <chr> "BEDOK", "BOON LAY", "BUKIT BATOK", "BUKIT MERAH", "
18 / 27
Simple feature (sf) is a common storage and access model for
geographic features with spatial geometries.
▶ At the most basic, an sf object that contains a special
geometry column with the spatial aspects of the features.
▶ In sg_map, it is the coordinates that describes the town
boundaries.
Artwork by Allison Horst
19 / 27
Maps: geom_sf()
We will use a special geom, geom_sf(), to visualize sf objects.
▶ The function uses a unique aesthetics: geometry.
ggplot(sg_map) +
geom_sf(aes(geometry = geometry),
fill = "lightgray", color = "white") +
theme_void()
20 / 27
Second summary on ggplot2
Summary on some of the geoms we learned this week:
ggplot + geom_col (/bar) + geom_polygon (/sf)
21 / 27
ggplot2 themes
The default background of a ggplot2 graph is always light gray.
There are several reasons that the designers have:
1. White grid lines are visible, yet easy to tune out, keeping the
data prominent.
2. The grey background gives a similar color to typographic text,
preventing it from jumping out.
3. It creates a continuous field of color which ensures that the plot
is perceived as a single visual entity.
You may agree or disagree with these points. If you would like to
alter some of these elements, they can be done by selecting a different
theme for your plot.
22 / 27
Themes
23 / 27
Layouts
▶ To combine separate ggplots into one, we can use patchwork, an
extension to ggplot2.
library(patchwork)
p1 <- ggplot(murders, aes(x = region)) + geom_bar()
p2 <- p1 + theme_minimal() + labs(title = "Minimal")
p3 <- p1 + theme_classic() + labs(title = "Classic")
p1 + p2 + p3
Minimal Classic
15 15 15
10 10 10
count
count
count
5 5 5
0 0 0
North Central
Northeast
South West North Central
Northeast
South West North Central
Northeast
South West
region region region
24 / 27
Layered grammar of graphics
Our initial template can be extend to:
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>),
stat = <STAT>, position = <POSITION>) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION> +
<SCALE_FUNCTION> +
<THEME_FUNCTION>
25 / 27
Figure generation pipelines
Most visualization is done for the purpose of communication.
▶ Who’s your audience?
▶ What’s the insight you’d like to convey?
Visualizations should be “autogenerated” as part of our data
analysis pipeline (which should also be automated).
▶ Ready for printing and sharing, without manual post-processing
needed.
▶ Able to tweak and re-generate the graph, which is quite
frequently encountered in data analysis.
Additional reading: Ch. 28 & 29 in Fundamentals of Data
Visualization (via Canvas).
26 / 27
ggplot2 and extensions
ggplot2 is a system for declaratively creating graphic, included in
tidyverse.
▶ Besides the functions we covered in lecture, explore more at
▶ https://ggplot2-book.org/
▶ Also, the ggplot2 extensions:
▶ https://exts.ggplot2.tidyverse.org/gallery/
27 / 27