Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
37 views27 pages

Week11 Slides

The document provides an introduction to ggplot2, a data visualization package in R, focusing on various geometrical objects such as bar plots and maps. It explains how to choose the appropriate plot type based on the data and communication goals, and covers the use of aesthetics and themes in visualizations. Additionally, it discusses the integration of ggplot2 with other packages for enhanced mapping capabilities and the importance of effective visual communication in data analysis.

Uploaded by

Tùng Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views27 pages

Week11 Slides

The document provides an introduction to ggplot2, a data visualization package in R, focusing on various geometrical objects such as bar plots and maps. It explains how to choose the appropriate plot type based on the data and communication goals, and covers the use of aesthetics and themes in visualizations. Additionally, it discusses the integration of ggplot2 with other packages for enhanced mapping capabilities and the importance of effective visual communication in data analysis.

Uploaded by

Tùng Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

DSA2101

Essential Data Analytics Tools: Data Visualization

Yuting Huang

AY24/25

Week 11 Introduction to ggplot2

1 / 27
Re-cap: Choosing the right plot
There are many geoms available in the ggplot2 package.
The choice of which one to use largely depends on two questions:
▶ What are you trying to communicate?
▶ What type of variable(s) do you want to show?

Source: Adapted from John F. Ouyang.

2 / 27
Prerequisites
▶ ggplot2 is included in tidyverse.

library(tidyverse)

Artwork by Allison Horst

3 / 27
Outline

1. Aesthetics and geometrical objects


▶ Scatterplot
▶ Smoother line
▶ Histogram and density plot
▶ Line plot
▶ Text annotations
▶ Bar plot
▶ Maps
2. Miscellaneous tasks
▶ Themes
▶ Layouts
▶ Common layers

4 / 27
Bar plot

We use a bar plot to visualize categorical variables.


▶ geom_col() creates bars where the height directly represents
values in the data.
▶ geom_bar() creates bars based on the count of observations in
each group – the counts are obtained through an internal
aggregation.

Some of the aesthetics that these geom functions use are:


▶ x (required)
▶ y (not required by geom_bar())
▶ color
▶ fill
▶ width

5 / 27
Bar plot: geom_col()
Let’s continue working on the murders.csv data set.
▶ Here’s a bar chart on the number of states in each region.

murders <- read.csv("../data/murders.csv")


state_by_region <- murders %>% count(region)
ggplot(state_by_region, aes(x = region, y = n)) +
geom_col()

15

10
n

North Central Northeast South West


region

6 / 27
Bar plot: geom_bar()
Alternatively, we can use geom_bar() to visualize the data.
▶ The function automatically counts the number of observations for
each x value – there’s no need to summarize the data beforehand.

ggplot(murders, aes(x = region)) +


geom_bar()

15

10
count

North Central Northeast South West


region

7 / 27
▶ By default, geom_bar() uses stat = "count", which means to
count the number of observations in each group.
▶ To use the values directly from the data, set stat = "identity".

ggplot(state_by_region, aes(x = region, y = n)) +


geom_bar(stat = "identity")

15

10
n

North Central Northeast South West


region

8 / 27
Maps: geom_polygon()

Plotting geo-spatial data is a common visualization task.


▶ The simplest way to draw maps is to use geom_polygon().
▶ We will need the latitude and longitude of the boundaries for
different regions.
▶ For US states, we can obtain the data from the maps package.

# install.packages("maps")
library(maps)
us_states <- map_data("state")

9 / 27
Maps: geom_polygon()
In the object us_states, we have the following variables:
▶ lat and long specify the latitude and longitude of the corners of
a polygon.
▶ group provides a unique id for each region.
▶ order provides the drawing order of boundary points.

glimpse(us_states)

## Rows: 15,537
## Columns: 6
## $ long <dbl> -87.46201, -87.48493, -87.52503, -87.53076, -87.570
## $ lat <dbl> 30.38968, 30.37249, 30.37249, 30.33239, 30.32665, 3
## $ group <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
## $ order <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
## $ region <chr> "alabama", "alabama", "alabama", "alabama", "alabam
## $ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,

10 / 27
Let’s first visualize the data using geom_point().
▶ Each row in the data is plotted as a single point – forming a
scatterplot that effectively shows the outline of every state.

ggplot(data = us_states, aes(x = long, y = lat)) +


geom_point(size = 0.25)

50

45

40
lat

35

30

25

−120 −100 −80


long

11 / 27
Maps: geom_polygon()
Now let’s turn this scatter plot into a map using geom_polygon().
▶ group specifies how to connect the coordinates into polygons.
▶ The order column is also used internally.

ggplot(data = us_states, aes(x = long, y = lat, group = group)) +


geom_polygon(color = "white", fill = "lightblue")

50

45

40
lat

35

30

25

−120 −100 −80


long

12 / 27
▶ Add data to fill each state according to its population. We shall
continue using the gun murders data, murders.csv.
▶ The first step is to merge the two data sets.

murders <- read.csv("../data/murders.csv")


df <- murders %>%
mutate(state = tolower(state)) %>%
left_join(us_states, by = c("state" = "region"))
glimpse(df)

## Rows: 15,539
## Columns: 10
## $ state <chr> "alabama", "alabama", "alabama", "alabama", "alaba
## $ abb <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "A
## $ region <chr> "South", "South", "South", "South", "South", "Sout
## $ population <int> 4779736, 4779736, 4779736, 4779736, 4779736, 47797
## $ total <int> 135, 135, 135, 135, 135, 135, 135, 135, 135, 135,
## $ long <dbl> -87.46201, -87.48493, -87.52503, -87.53076, -87.57
## $ lat <dbl> 30.38968, 30.37249, 30.37249, 30.33239, 30.32665,
## $ group <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
## $ order <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
## $ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
13 / 27
Maps: geom_polygon()

ggplot(df, aes(x = long, y = lat, group = group, fill = region)) +


geom_polygon(color = "white") +
theme(legend.position = "top")

region North Central Northeast South West

50

45

40
lat

35

30

25
−120 −100 −80
long

14 / 27
Maps: geom_polygon()

ggplot(df, aes(x = long, y = lat, group = group, fill=population/1e6))+


geom_polygon(color = "white") +
theme(legend.position = "top")

population/1e+06
10 20 30

50

45

40
lat

35

30

25
−120 −100 −80
long

15 / 27
▶ Once a map is created, we often need to modify the color
schemes.
▶ . . . with scale_fill_continuous() in this example.
▶ Also, theme_void() modifies the theme of the visualization.

ggplot(df, aes(x = long, y = lat, group = group, fill=population/1e6))+


geom_polygon(color = "white") +
scale_fill_continuous(name = "Population (millions)",
low = "lightgray", high = "steelblue") +
theme_void() +
theme(legend.position = "top")

Population (millions)
10 20 30

16 / 27
More on maps

Other maps in the map_data() function:


▶ Countries: usa, france, italy, nz
▶ Within the US: county, state
▶ World: world, world2

There are other packages and methods for maps. But you will need to
do your research to look for geographic information that defines the
map boundaries.

17 / 27
Singapore planning regions
The file sg_masterplan2019.rds contains Singapore’s planning area
boundary in 2019.
▶ The original data come from the Urban Redevelopment
Authority.
▶ We will need the sf package before loading the data.

# install.packages("sf")
library(sf)
sg_map <- readRDS("../data/sg_masterplan2019.rds")
class(sg_map)

## [1] "sf" "tbl_df" "tbl" "data.frame"

glimpse(sg_map)

## Rows: 55
## Columns: 2
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((103.9321 1...., MULTIPO
## $ town <chr> "BEDOK", "BOON LAY", "BUKIT BATOK", "BUKIT MERAH", "
18 / 27
Simple feature (sf) is a common storage and access model for
geographic features with spatial geometries.
▶ At the most basic, an sf object that contains a special
geometry column with the spatial aspects of the features.
▶ In sg_map, it is the coordinates that describes the town
boundaries.

Artwork by Allison Horst


19 / 27
Maps: geom_sf()
We will use a special geom, geom_sf(), to visualize sf objects.
▶ The function uses a unique aesthetics: geometry.

ggplot(sg_map) +
geom_sf(aes(geometry = geometry),
fill = "lightgray", color = "white") +
theme_void()

20 / 27
Second summary on ggplot2

Summary on some of the geoms we learned this week:

ggplot + geom_col (/bar) + geom_polygon (/sf)

21 / 27
ggplot2 themes

The default background of a ggplot2 graph is always light gray.


There are several reasons that the designers have:

1. White grid lines are visible, yet easy to tune out, keeping the
data prominent.
2. The grey background gives a similar color to typographic text,
preventing it from jumping out.
3. It creates a continuous field of color which ensures that the plot
is perceived as a single visual entity.

You may agree or disagree with these points. If you would like to
alter some of these elements, they can be done by selecting a different
theme for your plot.

22 / 27
Themes

23 / 27
Layouts
▶ To combine separate ggplots into one, we can use patchwork, an
extension to ggplot2.

library(patchwork)
p1 <- ggplot(murders, aes(x = region)) + geom_bar()
p2 <- p1 + theme_minimal() + labs(title = "Minimal")
p3 <- p1 + theme_classic() + labs(title = "Classic")
p1 + p2 + p3

Minimal Classic

15 15 15

10 10 10
count

count

count
5 5 5

0 0 0
North Central
Northeast
South West North Central
Northeast
South West North Central
Northeast
South West
region region region

24 / 27
Layered grammar of graphics

Our initial template can be extend to:

ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>),
stat = <STAT>, position = <POSITION>) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION> +
<SCALE_FUNCTION> +
<THEME_FUNCTION>

25 / 27
Figure generation pipelines

Most visualization is done for the purpose of communication.


▶ Who’s your audience?
▶ What’s the insight you’d like to convey?

Visualizations should be “autogenerated” as part of our data


analysis pipeline (which should also be automated).
▶ Ready for printing and sharing, without manual post-processing
needed.
▶ Able to tweak and re-generate the graph, which is quite
frequently encountered in data analysis.

Additional reading: Ch. 28 & 29 in Fundamentals of Data


Visualization (via Canvas).

26 / 27
ggplot2 and extensions

ggplot2 is a system for declaratively creating graphic, included in


tidyverse.
▶ Besides the functions we covered in lecture, explore more at
▶ https://ggplot2-book.org/
▶ Also, the ggplot2 extensions:
▶ https://exts.ggplot2.tidyverse.org/gallery/

27 / 27

You might also like