Introduction to Data Science and Engineering
- Data visualization
RB Luo / 羅銳邦
Department of Computer Science
University of Hong Kong
Some materials courtesy of
Rafael A. Irizarry, and are modified
from the original version.
dplyr has updated from 1.0.10 to 1.1.0 on Jan 31st,
2023 that caused some behavioral differences from
the course materials
Question contributed by classmate Nayoung Lim (class 2023)
"""
As I was trying the examples from your lecture note, I spotted an unexpected error.
The error message says that the multiple summaries (returning more than 1 row) say to
be deprecated in dplyr 1.1.0 version. I would like you to confirm (1) if multiple summaries
are completely replaced and no longer in use anymore and if so, (2) how to use reframe()
instead to generate the same result as the example as suggested from the error message.
P.S) The code I was trying is from ‘Multiple summaries’ chapter, pg.18 from Lec04-
tidyverse.
"""
My reply
"""
Very good question!
To give you the answer first, please use the following code to install the previous version
of tidyverse, that contains the previous version 1.0.10 of dplyr. When installing, it asks
you a few questions, just select update all should there be such a question.
install.packages("devtools")
require(remotes)
install_version("tidyverse", version = "1.3.2", repos = "http://cran.r-project.org")
tidyverse has a very active community that they are not conservative on obsoleting old
ways if there are better ways. But as a coder, and especially if you wish the analysis
pipeline you wrote can be used by the others with the same exceptions you had on your
side, an important skill is to fix the versions of the libraries that you depended on.
In your case, you got a warning, not an error. And after the warning, check temp, it shows
what you need. The warning message only tells you that summarize function is
deprecated in the new version. It has not been removed. And the function will last for
many subsequent versions until the developers can make sure that removing the
summarize function won’t cause serious problems to the libraries that depend on it.
"""
Why visualize?
Look at the table for 10 seconds, what Look at a sensible visualization of the
table for 10 seconds, what you see?
you see?
How quickly can you determine which states have the largest populations? Which states have the smallest?
How large is a typical state? Is there a relationship between population size and total murders? How do
murder rates vary across regions of the country?
Visualization in
the early days
John Snow Cholera map
https://www.tableau.com/learn/arti
cles/best-beautiful-data-
visualization-examples
Influential visualization nowadays
https://coronavirus.jhu.edu/map.html
This is an example of how data
visualization can lead to discoveries
which would otherwise be missed if we
simply subjected the data to a battery
of data analysis tools or procedures.
Data visualization is the strongest tool
of what we call exploratory data
analysis (EDA). John W. Tukey,
considered the father of EDA, once
said,
“The greatest value of a picture is
when it forces us to notice what we
never expected to see.”
Many widely used data analysis tools
were initiated by discoveries made via
EDA. EDA is perhaps the most
important part of data analysis, yet it is
one that is often overlooked.
https://en.wikipedia.org/wiki/Demogr
aphics_of_Hong_Kong
Animation
(Moving targets are more visible)
Hans Rosling, The best stats you've ever seen, TED2006
https://www.ted.com/talks/hans_rosling_the_best_stats_you_ve_ever_seen
Visualization prevents garbage in,
garbage out
It is also important to note that mistakes, biases, systematic errors
and other unexpected problems often lead to data that should be
handled with care. Failure to discover these problems can give rise
to flawed analyses and false discoveries. As an example, consider
that measurement devices sometimes fail and that most data
analysis procedures are not designed to detect these. Yet these data
analysis procedures will still give you an answer. The fact that it can
be difficult or impossible to notice an error just from the reported
results makes data visualization particularly important
Data visualization
• We will use the ggplot2 package to code
Load ggplot2
or simply
• Facts
• gg means "grammar of graphics"
• By learning a handful of ggplot2 building blocks and its grammar,
you will be able to create hundreds of different plots
• Assuming that our data is tidy, ggplot2 simplifies plotting code and
the learning of grammar for a variety of plots
• There are other libraries in R that also creates graphics
• Such as grid and lattice
• ggplot2 generates a plot with multiple layers, allowing adding
components or elements to a figure one at a time
• ggplot2 functions and arguments needs memorization, we recommend
you have the ggplot2 cheat sheet handy
• https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-
cheatsheet.pdf
• Search for “ggplot2 cheat sheet”.
The components of a graph
• The first step in learning ggplot2 is to be able to break a graph
apart into components. Let’s break down the plot example and
introduce some of the ggplot2 terminology. The main three
components to note are
• Data: The US murders data table is being summarized. We refer to
this as the data component
• Geometry: The plot above is a scatterplot. This is referred to as the
geometry component. Other possible geometries are barplot,
histogram, smooth densities, qqplot, and boxplot. We will learn
more about these in the Data Visualization part of the book
• Aesthetic mapping: The plot uses several visual cues to represent
the information provided by the dataset. The two most important
cues in this plot are the point positions on the x-axis and y-axis,
which represent population size and the total number of murders,
respectively. Each point represents a different observation, and we
map data about these observations to visual cues like x- and y-
scale. Color is another visual cue that we map to region. We refer
to this as the aesthetic mapping component. How we define the
mapping depends on what geometry we are using
• Other components including
• Coordination system, position adjustment, label and legend,
theme, facet
The components of a graph
• We also note that
• The points are labeled with the state abbreviations
• The range of the x-axis and y-axis appears to be defined by
the range of the data. They are both on log-scales
• There are labels, a title, a legend, and we use the style of
The Economist magazine
• Now we construct the plot piece by piece
• We start by loading the dataset
ggplot objects
• The first step in creating a ggplot2 graph is
or to define a ggplot object. We do this with
the function ggplot, which initializes the
graph. If we read the help file for this
function, we see that the first argument is
used to specify what data is associated
with this object, thus pipe can be used
• It renders a plot, in this case a blank slate
since no geometry has been defined. The
only style choice we see is a grey
background
• If assigned ggplot an object, evaluate it or
print it to show it.
Geometries
• In ggplot2 we create graphs by adding layers. Layers can define geometries,
compute summary statistics, define what scales to use, or even change styles. To
add layers, we use the symbol +. In general, a line of code will look like this:
DATA |> ggplot() + LAYER 1 + LAYER 2 + … + LAYER N
Available
geometries
See these
examples?
Give them
a try
https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf
Aesthetic mappings
Aesthetic mappings describe how properties of the data connect with features of the graph,
such as distance along an axis, size, or color. The aes function connects data with what we see on
the graph by defining aesthetic mappings and will be one of the functions you use most often
when plotting. The outcome of the aes function is often used as the argument of a geometry
function. This example produces a scatterplot of total murders versus population in millions
We can drop the x = and y = if we wanted to since these are the first and second expected arguments.
Instead of defining our plot from scratch, we can also add a layer to the p object that was defined before
Aesthetic mappings
• The scale and labels are defined by default
• aes also uses the variable names from the
object component: we can use population and
total without having to call them as
murders$population and murders$total
• The behavior of recognizing the variables
from the data component is quite specific
to aes. With most functions, if you try to
access the values of population or total
outside of aes you receive an error
Layers
• A second layer in the plot we wish to make
involves adding a label to each point to
identify the state. The geom_label and
geom_text functions permit us to add text to
the plot with and without a rectangle behind
the text, respectively
• Because each point (each state in this case)
has a label, we need an aesthetic mapping to
make the connection between points and
labels. The code looks like this
Tinkering with arguments
Each geometry function has many
arguments other than aes and data.
They tend to be specific to the
function. For example, in the plot we
wish to make, the points are larger
than the default size. In the help file
we see that size is an aesthetic and we
can do this
size is not observation associated, and applies
to all data points
Nudge the label
Now because the points are larger it is
hard to see the labels. We can use the
nudge_x argument, which moves the
text slightly to the right or to the left
size is not observation associated, and applies
to all data points
Available
geometries
https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf
Use global against local aesthetic mappings
In the previous example, we defined the mapping aes(population/10^6, total) twice,
once in each geometry. We can avoid this by using a global aesthetic mapping
Applies to both geom_point and geom_text
later, geom_point does not need a label thus
ignored
Use both global and local aesthetic mappings
If necessary, we can override the global mapping
by defining a new mapping within each layer.
These local definitions override the global
Scales
Our desired scales are in log-scale. This is not the
default, so this change needs to be added through
a scales layer. The scale_x_continuous function
lets us control the behavior of scales
Because we are in the log-scale now, the nudge
must be made smaller
Reference:
Other
available
controls in the
cheatsheet
https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf
Scales
This particular transformation is so common that
ggplot2 provides the specialized functions
scale_x_log10 and scale_y_log10
Reference:
Other
available
controls in the
cheatsheet
https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf
Labels and titles
Plots without labels and titles are like babies
without name, it can't happen
Reference:
Other
available
controls in the
cheatsheet
https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf
Labels and titles alternatives
Some prefer using xlab, ylab, ggtitle instead of a
single labs for better "layerization"
Categories as colors
We can change the color of the points using the
col argument in the geom_point function
This is not what we want, but just a
demonstration of how to change a layer to a
single color
Categories as colors
A nice default behavior of ggplot2 is that if we
assign a categorical variable to color, it
automatically assigns a different color to each
category and also adds a legend
Since the choice of color is determined by a
feature of each observation, this is an aesthetic
mapping. To map each point to a color, we need
to use aes
These global aes are used
Categories as colors in all following geometries
While col=region is
specific to this geometry,
aes needs to be the first
argument
ggplot2 automatically adds a
legend that maps color to region.
To avoid adding this legend we set
the geom_point argument
show.legend = FALSE
Annotation, shapes, and adjustments
We often want to add shapes or annotation to
figures that are not derived directly from the
aesthetic mapping like labels, boxes, shaded
areas, and lines
Here we want to add a line that represents the
average murder rate r for the entire country.
When using normal scale, the line is defined as y =
rx, where r is the slope. In the log-scale, the line
turns into log(y) = log(r) + log(x), where slope is 1
and intercept is log(r)
Use a dashed line
Use dashed line or other line type (lty), and
lighter colors for a trend line or other
annotations
geom_abline (ab stands for y=ax+b) is not in the
cheatsheet, so make your own cheatsheet
Small tweaks
"region" is from the name of the column in the
original table, how to change it back to "Region"?
Well I myself like lowercasing everything (you
know what I am talking about if you like Taylor
Swift), but please make sure your style adds value
to you instead of causing you troubles. In a formal
working environment, we should still change
"region" to "Region"
Small tweaks
It's working
Ask chatGPT
Also working
Using theme
The power of ggplot2 is augmented further due to
the availability of add-on packages
The style of a ggplot2 graph can be changed using
the theme functions. Several themes are included
as part of the ggplot2 package
Using theme
A complete list of themes in ggthemes:
https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/
When there is a will, there is a way,
in R at least
Problem:
The labels are overlapping
Question:
How to rearrange the labels
in R so they don't overlap?
Use geom_text_repel instead of geom_text
Putting it all together
Grids of plots
Brief summary
• In the previous slides, we learned using ggplot, but with
just an example
• The functions and usages are not for you to memorize,
you will rely on the cheatsheet and Google heavily. Later
with more plots generated, you will have a collection of
working code of many types and styles of plots, that
speeds you up
• The final exam won’t demand a perfectly working copy of
code. Writing code from scratch is minimized in the final
exam. I care about correct understanding and good ideas
• Next, let’s learn visualizing data distributions
Visualizing data distributions
The medium salary is HKD 20k/16k for
Male/Female in Hong Kong in 2020
Compared to the information above, what in
M/F
addition can you obtain from the plot on the
left?
The most basic statistical summary of a list of
objects or numbers is its distribution. Once a
data has been summarized as a distribution,
there are several data visualization
techniques to effectively relay this
information
In the following slides, we will discuss
properties of a variety of distributions and
Census and Statistics Department of Hong Kong how to visualize distributions using a
* excluded foreign domestic workers
motivating example of student heights
Variable types
• The two main variable types are categorical/numeric
• A categorical variable can be ordinal or nominal
• Ordinal: {very bad, bad, neutral, good, very good}
• Nominal: {Northeast, South, North Central, West}
• A numerical variable can be discrete or continuous
• Discrete: can be rounded, e.g., population size
• Continuous: needs decimals for precision, e.g., heights
We will use both the murders and heights datasets for following examples
Here are some details about the heights dataset
In case you want to work on metric units
A simple count of categorical data using barplot
Distribution of categorical data
With categorical data, the proportion of each unique category
describes the distribution
n is a product
of count()
Use
geom_col
instead of
geom_bar,
check
https://sta
ckoverflow.
com/a/590
09108
Easy?
The two examples show the number or
proportion of each category. We
usually use barplots to display a few
numbers. Although they do not
provide much more insight than a
frequency table itself, it is a first
example of how we convert a vector
into a plot that succinctly summarizes
all the information in the vector. When
the data is numerical, the task of
displaying distributions is more
challenging
Histogram
• Numerical data that are not categorical also have distributions.
However, in general, when data is not categorical, reporting the
frequency of each entry, as we did for categorical data using barplot,
is not an effective summary since most entries are unique. For
example, in our case study, while several students reported a height of
68 inches, only one student reported a height of 68.50393 inches and
only one student reported a height 68.89763 inches
• Histogram divides the span of data into non-overlapping bins of the
same size. Then, for each bin, we count the number of values that fall
in that interval. The histogram plots these counts as bars with the
base of the bar defined by the intervals
Histogram
Here is the histogram for the height data splitting the range of values
into one-inch intervals:
(49.5,50.5],(50.5,51.5],(51.5,52.5],(52.5,53.5],...,(82.5,83.5]
As you can see in the figure above, a histogram is similar to a barplot, but it differs in that
the x-axis is numerical, not categorical
There is a reason that we don’t see the world in B/W
Color is another language
https://www.datanovia.com/en/blog/awesome-list-of-657-r-color-names/
What’s the best binwidth?
Histogram
From the histogram, we can immediately learn some
important properties about our data
• The range of the data is from 50 to 84 with the majority
(more than 95%) between 63 and 75 inches
• The heights are close to symmetric around 69 inches
What information do we lose? Note that all values in each
interval are treated the same when computing bin heights.
So, for example, the histogram does not distinguish
between 64, 64.1, and 64.2 inches. Given that these
differences are almost unnoticeable to the eye, the
practical implications are negligible and we were able to
summarize the data to just 23 numbers
Smoothed density
Smooth density plots relay the same information as a histogram but are
aesthetically more appealing
In this plot, we no longer have sharp edges at the interval boundaries and many of the local peaks have been
removed. Also, the scale of the y-axis changed from counts to density. To fully understand smooth densities, we
have to understand estimates, a topic we don’t cover until later. Here we simply describe them as making the
histograms prettier by drawing a curve that goes through the top of the histogram bars and then removing the
bars. The values shown y-axis are chosen so that the area under the curve at up to 1. This implies that for any
interval, the area under the curve for that interval gives us an approximation of how what proportion of the data
is in the interval
Smoothed density
An advantage of smooth densities over histograms for visualization
purposes is that densities make it easier to compare two distributions. This
is in large part because the jagged edges of the histogram add clutter. Here
is an example comparing male and female heights
Line color Fill color
Transparency
Smoothed density
Normal distribution
Histograms and density plots provide excellent summaries of a distribution. But can we
summarize even further? We often see the average and standard deviation (just two
numbers) used as summary statistics. To understand what these summaries are and why
they are so widely used, we need to understand the normal distribution
The normal distribution, also known as the bell curve and as the Gaussian distribution.
Here is what the normal distribution looks like
The plot shows a normal distribution with
average 0 and SD 1
Normal distribution
The normal distribution is one of the most famous mathematical concepts in history. A reason for this is
that the distribution of many datasets can be approximated with normal distributions. These include
including gambling winnings, heights, weights, blood pressure, standardized test scores, and experimental
measurement errors. Later in "Introduction to statistics with R", we will learn why But how can the same
distribution approximate datasets with completely different ranges for values, for example heights and
weights?
Normal distribution can be adapted to different datasets by just adjusting two numbers, referred to as the
average or mean and the standard deviation (SD)
Note that the fact that because only two numbers are needed to adapt the normal distribution to a
dataset, implies that if our data distribution can be approximated by a normal distribution, all the
information needed to describe the distribution can be encoded in just two numbers
Normal distribution
To investigate whether the heights dataset complies normal distribution, what to do in
visualization? Overlay the heights distribution with a normal distribution
The distribution of heights seems to be a normal
distribution, so what?
We will learn more later in the statistics section
Boxplot To understand boxplots, we need to define some terms that are
commonly used in exploratory data analysis
The percentiles are the values for which p = 0.01, 0.02, ... , 0.99 of
the data are less then or equal to that value, respectively. We call,
for example, the case of p = 0.10 the 10th percentile, which gives us
a number for which 10% of the data is below. The most famous
percentile is the 50th, also known as the median. Another special
case that receives a name are the quartiles, which are obtained
when setting p = 0.25, 0.50, and 0.75, which are used by the boxplot
To motivate boxplots, we will go back to the US murder data.
Suppose we want to summarize the murder rate distribution. Using
the data visualization skills we have learned, we can quickly see that
the normal approximation does not apply here, more information is
needed to describe the distribution of the data
Boxplot
The boxplot provides a five-number summary composed of the range along with the quartiles (the 25th,
50th, and 75th percentiles). The R implementation of boxplots ignore outliers when computing the
range and instead plot these as independent points. The boxplot shows these numbers as a “box” with
“whiskers”
How to define outliers?
Interquartile range (IQR), also called the middle 50%, is a measure of statistical dispersion, being equal
to the difference between 75th and 25th percentiles.
Outliers here are defined as observations that fall below Q1 − 1.5 IQR or above Q3 + 1.5 IQR.
Boxplot
From boxplot, we know that the
distribution is not symmetric, and
there are a few outliers
Boxplot
Q-Q plot
A Q-Q plot, short for “quantile-quantile” plot, is used to
assess whether or not a set of data potentially came from
some theoretical distribution
In most cases, this type of plot is used to determine
whether or not a set of data follows a normal distribution
If the data is normally distributed, the points in a Q-Q plot
will lie on a straight diagonal line
Q-Q Plots Explained
https://towardsdatascience.com/q-q-plots-explained-5aa8495426c0
Stratification
In the previous examples, the murders dataset was grouped by regions, the heights
dataset was grouped by gender
In data analysis we often divide observations into groups based on the values of
one or more variables associated with those observations. We call this procedure
stratification and refer to the resulting groups as strata. Stratification is common in
data visualization because we are often interested in how the distribution of
variables differs across different subgroups
So much for visualizing data distributions
Next, we will work on some more examples, and use more
ggplot geometries and tricks