Combined
Combined
Course Overview
• Course Overview 10
• Course objectives 10
• Course Expectations 10
• Introduction to ggplot2 13
• Objectives 13
• Going beyond Excel 13
• Why ggplot2? 16
• Getting help 17
• Geom functions 20
• Colors 24
• Facets 30
• Resource list 38
• Acknowledgements 38
• Objectives 39
• The Data 39
• Scatter plots 41
• Basic Scatter 41
• Coordinate Systems 42
• Perform and plot PCA data using iris. 43
• What is PCA? 43
• Perform PCA 44
• Plot PCA 45
• Plot customization 49
• Themes 49
• Modifying legends 57
• References 61
• Libraries 63
• The Data 63
• Variable Types 64
• Bar Plot 66
• stat = count 67
• stat = identity 69
• Using factors 71
• Histogram 75
• Box plot 78
• What is a heatmap? 81
• What is a dendrogram? 81
• Applications of dendrograms 81
• Import data 85
• Distance calculation 88
• Cluster generation 88
• Scaling 88
• Objectives 105
• Acknowledgements 133
Getting the Data
Practice Questions
• Question 1 137
• Question 2 138
• Question 3 139
• Question 4 139
• Question 5 140
• Question 6 141
• Question 1 145
• Question 2 145
• Question 3 146
• Question 4 147
• Question 5 148
• Question 6 149
Practice plotting using ggplot2: Lesson 3 151
• Question 1 151
• Question 2 152
• Question 3 153
• Question 4 155
• Question 5 156
Lesson 4: Stat Transformations: Bar plots, box plots, and histograms 157
• Load the mtcars dataset using the code below. This is a dataset that comes with R. 157
• Question 1 157
• Question 2 158
• Question 3 159
• Question 4 160
• Question 1 162
• Question 2 163
• Question 3 163
• Question 4 164
• Question 5 165
Additional Resources
Course Overview
Course objectives
1. Learn how to generate basic plot types in ggplot2
2. Understand how basic plot types can be customized to generate more complex plots
Note
While this course may be useful to learners with intermediate R experience who would like to learn more regarding
ggplot2, the pace of the course will be set assuming a beginner level of understanding.
Course Expectations
This course will include a series of six, 1-1.25 hour lessons over a period of three weeks. Each
lesson will be followed by a 45 minute help session in which students can ask questions and /
or get individual help with their data.
this series will start with lesson 2. In the help session afterwards we will help those having
trouble with DNAnexus accounts.
Introduction to ggplot2
Objectives
1. Learn how to import spreadsheet data.
By the end of the course, students should be able to create simple, pretty, and effective figures.
#View data
data
## # A tibble: 8 × 4
## `Sample Name` Treatment `Number of Transcripts` `Total Counts`
## <chr> <chr> <dbl> <dbl>
## 1 GSM1275863 Dexamethasone 10768 18783120
## 2 GSM1275867 Dexamethasone 10051 15144524
These data include total transcript read counts summed by sample and the total number of
transcripts recovered by sample that had at least 100 reads. These data derive from a bulk
RNAseq experiment described by Himes et al. (2014) (https://pubmed.ncbi.nlm.nih.gov/
24926665/). In the experiment, the authors "characterized transcriptomic changes in four
primary human ASM cell lines that were treated with dexamethasone," a common therapy for
asthma. Each cell line included a treated and untreated negative control resulting in a total
sample size of 8.
Spaces cause problems for data wrangling in R, but we can change our load parameters to
repair our column names.
## New names:
## • `Sample Name` -> `Sample.Name`
## • `Number of Transcripts` -> `Number.of.Transcripts`
## • `Total Counts` -> `Total.Counts`
We could plot this data in excel. If we did, we would get something like this:
This isn't too bad, but it took an unnecessary amount of time, and there weren't a lot of options
for customization.
RECOMMENDATION
You should save metadata or other tabular data as either comma separated files (.csv) or tab-delimited files (.txt,
.tsv). Using these file extensions will make it easier to use the data with bioinformatic programs. There are multiple
functions available to read in delimited data in R. We will see a few of these over the next few weeks.
Why ggplot2?
Outside of base R plotting, one of the most popular packages used to generate graphics in R is
ggplot2, which is associated with a family of packages collectively known as the tidyverse.
GGplot2 allows the user to create informative plots quickly by using a 'grammar of graphics'
implementation, which is described as "a coherent system for describing and building graphs"
(R4DS). We will see this in action shortly. The power of this package is that plots are built in
layers and few changes to the code result in very different outcomes. This makes it easy to
reuse parts of the code for very different figures. GGplot2 is incredibly versatile and can create
most types of plots, especially when you consider the numerous packages (https://
exts.ggplot2.tidyverse.org/gallery/) that further extend its capabilities.
Being a part of the tidyverse collection, ggplot2 works best with data organized so that
individual observations are in rows and variables are in columns ("tidy data (https://
r4ds.had.co.nz/tidy-data.html)").
Getting help
The R community is extensive and getting help is now easier than ever with a simple web
search. If you can't figure out how to plot something, give a quick web search a try. Great
resources include internet tutorials, R bookdowns, and stackoverflow. You should also use the
help features within RStudio to get help on specific functions or to find vignettes. Try entering
ggplot2 in the help search bar in the lower right panel under the Help tab.
Note
Though it was created for ChatGPT, you may find this resource from Datacamp (https://www.datacamp.com/cheat-
sheet/chatgpt-cheat-sheet-data-science) useful for prompting appropriate responses.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
The main components include the data we want to plot, geom function(s), and mapping
aesthetics. Notice the + symbol following the ggplot() function. This symbol will precede
each additional layer of code for the plot, and it is important that it is placed at the end of the
line. More on geom functions and mapping aesthetics to come.
What is the relationship between total transcript sums per sample and the number of recovered
transcripts per sample?
We can easily see that there is a relationship between the number of transcripts per sample and
the total transcripts recovered per sample. ggplot2 default parameters are great for
exploratory data analysis. But, with only a few tweaks, we can make some beautiful, publishable
figures.
The data we called was from the data frame data, which we created above. Next, we provided
a geom function (geom_point()), which created a scatter plot. This scatter plot required
mapping information, which we provided for the x and y axes. More on this in a moment.
Geom functions
A geom is the geometrical object that a plot uses to represent data. People often
describe plots by the type of geom that the plot uses. --- R4DS (https://
r4ds.had.co.nz/data-visualisation.html#geometric-objects)
There are multiple geom functions that change the basic plot type or the plot representation. We
can create scatter plots (geom_point()), line plots (geom_line(),geom_path()), bar plots
(geom_bar(), geom_col()), line modeled to fitted data (geom_smooth()), heat maps
(geom_tile()), geographic maps (geom_polygon()), etc.
ggplot2 provides over 40 geoms, and extension packages provide even more (see
https://exts.ggplot2.tidyverse.org/gallery/ (https://exts.ggplot2.tidyverse.org/
gallery/) for a sampling). The best way to get a comprehensive overview is the
ggplot2 cheatsheet, which you can find at http://rstudio.com/resources/cheatsheets
(http://rstudio.com/resources/cheatsheets). --- R4DS (https://r4ds.had.co.nz/data-
visualisation.html)
You can also see a number of options pop up when you type geom into the console, or you can
look up the ggplot2 documentation in the help tab.
We can see how easy it is to change the way the data is plotted. Let's plot the same data using
geom_line().
ggplot(data=data) +
geom_line(aes(x=Number.of.Transcripts, y = Total.Counts))
The geom functions require a mapping argument. The mapping argument includes the aes()
function, which "describes how variables in the data are mapped to visual properties
(aesthetics) of geoms" (ggplot2 R Documentation). If not included it will be inherited from the
ggplot() function.
Let's return to our plot above. Is there a relationship between treatment ("dex") and the number
of transcripts or total counts?
There is potentially a relationship. ASM cells treated with dexamethasone in general have lower
total numbers of transcripts and lower total counts.
Notice how we changed the color of our points to represent a variable, in this case. To do this,
we set color equal to 'Treatment' within the aes() function. This mapped our aesthetic, color, to
a variable we were interested in exploring. Aesthetics that are not mapped to our variables are
placed outside of the aes() function. These aesthetics are manually assigned and do not
undergo the same scaling process as those within aes().
For example
#use the color purple across all points (NOT mapped to a variable)
ggplot(data=data) +
geom_point(aes(x=Number.of.Transcripts, y = Total.Counts,
shape=Treatment), color="purple")
We can also see from this that 'Treatment' could be mapped to other aesthetics. In the above
example, we see it mapped to shape rather than color. By default, ggplot2 will only map six
shapes at a time, and if your number of categories goes beyond 6, the remaining groups will go
unmapped. This is by design because it is hard to discriminate between more than six shapes
at any given moment. This is a clue from ggplot2 that you should choose a different aesthetic to
map to your variable. However, if you choose to ignore this functionality, you can manually
assign more than six shapes (https://r-graphics.org/recipe-scatter-shapes).
We could have just as easily mapped it to alpha, which adds a gradient to the point visibility by
category, or we could map it to size. There are multiple options, so feel free to explore a little
with your plots.
The assignment of color, shape, or alpha to our variable was automatic, with a unique aesthetic level representing
each category (i.e., 'Dexamethasone', 'none') within our variable. You will also notice that ggplot2 automatically
created a legend to explain the levels of the aesthetic mapped. We can change aesthetic parameters - what colors
are used, for example - by adding additional layers to the plot.
scatter_plot<-ggplot(data=data) +
geom_point(aes(x=Number.of.Transcripts, y = Total.Counts,
color=Treatment))
scatter_plot
We can add additional layers directly to our object. We will see how this works by defining some
colors for our 'dex' variable.
Colors
ggplot2 will automatically assign colors to the categories in our data. Colors are assigned to
the fill and color aesthetics in aes(). We can change the default colors by providing an
additional layer to our figure. To change the color, we use the scale_color functions:
scale_color_manual(), scale_color_brewer(), scale_color_grey(), etc. We can
also change the name of the color labels in the legend using the labels argument of these
functions
scatter_plot +
scale_color_manual(values=c("red","black"),
labels=c('treated','untreated'))
scatter_plot +
scale_color_grey()
scatter_plot +
scale_color_brewer(palette = "Paired")
Similarly, if we want to change the fill, we would use the scale_fill options. To apply
scale_fill to shape, we will have to alter the shapes, as only some shapes take a fill
argument.
ggplot(data=data) +
geom_point(aes(x=Number.of.Transcripts, y = Total.Counts,fill=Treatment),
shape=21,size=2) + #increase size and change points
scale_fill_manual(values=c("purple", "yellow"))
There are a number of ways to specify the color argument including by name, number, and hex
code. Here (https://www.r-graph-gallery.com/ggplot2-color.html) is a great resource from the R
Graph Gallery (https://www.r-graph-gallery.com/index.html) for assigning colors in R.
There are also a number of complementary packages in R that expand our color options. One
of my favorites is viridis, which provides colorblind friendly palettes. randomcoloR is a
great package if you need a large number of unique colors.
Paletteer contains a comprehensive set of color palettes, if you want to load the palettes
from multiple packages all at once. See the Github page (https://github.com/EmilHvitfeldt/
paletteer) for details.
Facets
A way to add variables to a plot beyond mapping them to an aesthetic is to use facets or
subplots. There are two primary functions to add facets, facet_wrap() and facet_grid().
If faceting by a single variable, use facet_wrap(). If multiple variables, use facet_grid().
The first argument of either function is a formula, with variables separated by a ~ (See below).
Variables must be discrete (not continuous).
#plot
ggplot(data=data) +
geom_point(aes(x=Number.of.Transcripts, y = Total.Counts,fill=Sample.Name),
shape=21,size=2) + #increase size and change points
scale_fill_viridis(discrete=TRUE, option="viridis") +
facet_wrap(~Treatment)
Note the help options with ?facet_wrap(). How would we make our plot facets vertical rather
than horizontal?
ggplot(data=data) +
geom_point(aes(x=Number.of.Transcripts, y = Total.Counts,fill=Sample.Name),
shape=21,size=2) + #increase size and change points
scale_fill_viridis(discrete=TRUE, option="viridis") +
facet_wrap(~Treatment, ncol=1)
Be sure to take a look at facet_grid(). Facet_grid would allow us to map even more
variables in our data
ggplot(data=data) +
geom_point(aes(x=Number.of.Transcripts, y = Total.Counts,fill=Sample.Name),
shape=21,size=2) + #increase size and change points
scale_fill_viridis(discrete=TRUE, option="viridis") +
facet_grid(Sample.Name~Treatment)
data("Titanic")
Titanic <- as.data.frame(Titanic)
head(Titanic)
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
) +
<FACET_FUNCTION>
Note that there are a lot of invisible (default) layers that often go into each ggplot2, and there
are ways to customize these layers. See this chapter (https://r4ds.had.co.nz/data-
visualisation.html#the-layered-grammar-of-graphics) from R for Data Science for more
information on the grammar of graphics.
#to make our code more effective, we can put shared aesthetics in the
#ggplot function
ggplot(data=data, aes(x=Number.of.Transcripts,
y = Total.Counts, color= Treatment)) +
geom_point() +
geom_smooth(method='lm')
Here's a teaser.
ggplot(data=data) +
geom_point(aes(x=Number.of.Transcripts, y = Total.Counts,
fill=Treatment),
shape=21,size=2) +
scale_fill_manual(values=c("purple", "yellow"),
labels=c('treated','untreated'))+
#can change labels of fill levels along with colors
xlab("Recovered transcripts per sample") + #add x label
ylab("Total sequences per sample") +#add y label
guides(fill = guide_legend(title="Treatment")) + #label the legend
scale_y_continuous(trans="log10") + #log transform the y axis
theme_bw()
ggsave("Plot1.png",width=5.5,height=3.5,units="in",dpi=300)
Resource list
1. ggplot2 cheatsheet
2. The R Graph Gallery (https://www.r-graph-gallery.com/)
3. The R Graphics Cookbook (https://r-graphics.org/recipe-quick-bar)
4. Ggplot2 extensions (https://exts.ggplot2.tidyverse.org/gallery/)
Acknowledgements
Material from this lesson was adapted from Chapter 3 of R for Data Science (https://
r4ds.had.co.nz/data-visualisation.html) and from a 2021 workshop entitled Introduction to Tidy
Transciptomics (https://stemangiola.github.io/bioc2021_tidytranscriptomics/articles/
tidytranscriptomics.html) by Maria Doyle and Stefano Mangiola.
Objectives
1. Learn to customize your ggplot with labels, axes, text annotations, and themes.
2. Learn how to make and modify scatter plots to make fairly different overall plot
representations.
The primary purpose of this lesson is to learn how to customize our ggplot2 plots. We will do
this by focusing on different types of scatter plots.
The Data
In this lesson we will use two different sets of data. First, we will use data available with your
base R installation, the iris data set, which is stored in object iris. These data include
measurements from the petals and sepals of different Iris species including Iris setosa,
versicolor, and virginica. See ?iris for more information about these data.
Second, we will use some more complicated bioinformatics data related to the RNAseq project
introduced in Lesson 2.
First, let's load our libraries using the library function, library():
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
Now, let's load the RNA data that we will use toward the end of this lesson. We will use the
function read.delim() to load tab delimited RNASeq data and the function readLines() to
load the list of top genes.
dexp_sigtrnsc<-read.delim("../data/sig_dexp_results.txt",
as.is=TRUE)
topgenes<-readLines("../data/topgenes.txt")
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
) +
<FACET_FUNCTION>
However, there are additional components, highlighted in bold, that can be added to our core
components to enable us to generate even more diverse plot types.
• one or more geometric objects that serve as the visual representations of the
data, – for instance, points, lines, rectangles, contours,
• descriptions of how the variables in the data are mapped to visual properties
(aesthetics) of the geometric objects, and an associated scale (e. g., linear,
logarithmic, rank),
• optional parameters that affect the layout and rendering, such text size, font
and alignment, legend positions.
We will extend our basic template throughout this lesson as we make a variety of scatter plots
and in lesson 4.
Scatter plots
Scatterplots are useful for visualizing treatment–response comparisons,
associations between variables, or paired data (e.g., a disease biomarker in several
patients before and after treatment).Holmes and Huber, 2021 (https://
web.stanford.edu/class/bios221/book/03-chap.html)
Because scatter plots involve mapping each data point, the geom function used is
geom_point(). We saw a fairly basic implementation of this in Lesson 2.
Basic Scatter
Let's take another look at a simple scatter plot using the iris data. We can look at the
relationship between petal length and petal width (i.e., variable association) for the various Iris
species.
This code should look fairly familiar, with the exception of a new function, coord_fixed(). This
is a modification of the ggplot2 coordinate system.
Coordinate Systems
Coordinate systems are probably the most complicated part of ggplot2. The default
coordinate system is the Cartesian coordinate system where the x and y positions
act independently to determine the location of each point. --- R4DS (https://
r4ds.had.co.nz/data-visualisation.html#coordinate-systems)
coord_fixed() with the default argument ratio=1 ensures that the units are represented
equally in physical space on the plot. Because the x and y measurements were both taken in
centimeters, it is good practice to make sure that the "same mapping of data space to physical
space is used." --- Holmes and Huber, 2021 (https://web.stanford.edu/class/bios221/book/03-
chap.html)
You will not need to worry about the coordinate system of your plot in most cases, but it is likely
you will need to mess with the coordinate system at some point in the future. Another commonly
used coordinate function is coord_flip(), which allows you to flip the representation of the
plot, for example, by switching bars in a bar plot from vertical to horizontal. See ?
coord_flip() for more information.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
) +
<FACET_FUNCTION> +
<COORDINATE SYSTEM>
What is PCA?
Principal component analysis (PCA) is a linear dimension reduction method applied to highly
dimensional data. The goal of PCA is to reduce the dimensionality of the data by transforming
the data in a way that maximizes the variance explained. Read more here (https://
towardsdatascience.com/principal-component-analysis-pca-79d228eb9d24) and here (https://
www.huber.embl.de/msmb/Chap-Multivariate.html).
Key points:
Note
PCAs are used frequently in -omics fields. However, often than not, there will be package specific functions for PCA
and plotting PCA for different -omics analyses. Because of this, we will show the main features here using a simpler
data set, iris.
Perform PCA
We can use the function prcomp() to run PCA on the first four columns of the iris data. The
function takes numeric data.
colnames(iris)[1:4]
#get structure of df
str(pca)
## List of 5
## $ sdev : num [1:4] 1.708 0.956 0.383 0.144
## $ rotation: num [1:4, 1:4] 0.521 -0.269 0.58 0.565 -0.377 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Widt
## .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
## $ center : Named num [1:4] 5.84 3.06 3.76 1.2
## ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length
## $ scale : Named num [1:4] 0.828 0.436 1.765 0.762
## ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length
## $ x : num [1:150, 1:4] -2.26 -2.07 -2.36 -2.29 -2.38 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
## - attr(*, "class")= chr "prcomp"
The object pca is a list of 5: the standard deviations of the principal components, a matrix of
variable loadings, the scaling used, and the data projected on the principal components.
Plot PCA
To plot the first two axes of variation along with species information, we will need to make a data
frame with this information. The axes are in pca$x.
#Plot
ggplot(pcaData) +
aes(PC1, PC2, color = Species, shape = Species) + # define plot area
geom_point(size = 2) + # adding data points
coord_fixed() # fixing coordinates
Info
This is a decent plot showing us how the species relate based on characteristics of their sepals
and petals. From this plot, we see that Iris virginica and Iris versicolor are more similar than Iris
setosa.
But, the axes are missing the % explained variance. Let's add custom axes. We can do this with
the xlab() and ylab() functions or the functions labs(). But first we need to grab some
information from our PCA analysis. Let's use summary(pca). This function provides a summary
of results for a variety of model fitting functions and methods.
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7084 0.9560 0.38309 0.14393
PC1 and PC2 combined account for 96% of variance in the data. We can add this information
directly to our plot using custom axes labels.
#Plot
ggplot(pcaData) +
aes(PC1, PC2, color = Species, shape = Species) +
geom_point(size = 2) +
coord_fixed() +
xlab("PC1: 73%")+ #x axis label text
ylab("PC2: 23%") # y axis label text
If you want to automate the "Proportion of Variance", you should call it directly in the code. For example,
ggplot(pcaData) +
aes(PC1, PC2, color = Species, shape = Species) +
geom_point(size = 2) +
coord_fixed() +
labs(x=paste0("PC1: ",summary(pca)$importance[2,1]*100,"%"),
y=paste0("PC2: ",summary(pca)$importance[2,2]*100,"%"))
ggplot(pcaData) +
aes(PC1, PC2, color = Species, shape = Species) +
geom_point(size = 2) +
coord_fixed() +
xlab("PC1: 73%")+
ylab("PC2: 23%") +
stat_ellipse(geom="polygon", level=0.95, alpha=0.2) #adding a stat
Plot customization
Themes
The plot above is looking pretty good, but there are many more features that can be customized
to make this publishable or fit a desired style. Changing non-data elements (related to axes,
titles subtitles, gridlines, legends, etc.) of our plot can be done with theme(). GGplot2 has a
definitive default style that falls under one of their precooked themes, theme_gray().
theme_gray() is one of eight complete themes provided by ggplot2.
We can also specify and build a theme within our plot code or develop a custom theme to be
reused across multiple plots. The theme function is the bread and butter of plot customization.
Check out ?ggplot2::theme() for a list of available parameters. There are many.
Let's see how this works by changing the fonts and text sizes and dropping minor grid lines:
ggplot(pcaData) +
aes(PC1, PC2, color = Species, shape = Species) + # define plot area
geom_point(size = 2) + # adding data points
coord_fixed() +
xlab("PC1: 73%")+
ylab("PC2: 23%") +
stat_ellipse(geom="polygon", level=0.95, alpha=0.2) +
theme_bw() + #start with a custom theme
You may want to establish a custom theme for reuse with a number of plots. See this great
tutorial (https://rpubs.com/mclaire19/ggplot2-custom-themes) by Madeline Pickens for steps on
how to do that.
#defining colors
iriscolors<-setNames(c("blue","black","orange"),levels(iris$Species))
#Now plot
ggplot(pcaData) +
aes(PC1, PC2, color = Species, shape = Species) +
Bioinformatics Training and Education Program
52 Scatter plots and plot customization
geom_point(size = 2) +
scale_color_manual(values=iriscolors)+ #Adding the color argument
coord_fixed() +
xlab("PC1: 73%")+
ylab("PC2: 23%") +
stat_ellipse(geom="polygon", level=0.95, alpha=0.2,aes(fill=Species)) +
scale_fill_manual(values=iriscolors)+ #Fill ellipses
theme_bw() +
theme(axis.text=element_text(size=12,family="Times New Roman"),
axis.title = element_text(size=12,family="Times New Roman"),
legend.text = element_text(size=12,family="Times New Roman"),
legend.title = element_text(size=12,family="Times New Roman"),
panel.grid.minor = element_blank())
We can use this color palette for all plots of these three species to keep our figures consistent
throughout a presentation or publication.
ggfortify is an excellent package to consider for easily generating PCA plots. ggfortify provides "unified
plotting tools for statistics commonly used, such as GLM, time series, PCA families, clustering and survival analysis.
The package offers a single plotting interface for these analysis results and plots in a unified style using 'ggplot2'."
(https://cran.r-project.org/web/packages/ggfortify/)
library(ggfortify)
autoplot(pca, data = iris, colour = 'Species',size=2)
Since this is a ggplot2 object, this can easily be customized by adding ggplot2 customization layers.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>
) +
<FACET_FUNCTION> +
<COORDINATE SYSTEM> +
<THEME>
Introducing EnhancedVolcano
There is a dedicated package for creating volcano plots available on Bioconductor, EnhancedVolcano (https://
bioconductor.org/packages/release/bioc/html/EnhancedVolcano.html). Plots created using this package can be
customized using ggplot2 functions and syntax.
Using EnhancedVolcano:
Let's take a quick look at the data we loaded at the beginning of the lesson:
head(dexp_sigtrnsc)
topgenes
Significant differential expression was assigned based on an absolute log fold change greater
than or equal to 2 and an FDR corrected p-value less than 0.05.
Let's start our plot with the <DATA>, <GEOM_FUNCTION>, and <MAPPING>. We do not need
to fix the coordinate system because we are working with two different values on the x and y
and we don't need any special coordinate system modifications. Let's plot logFC on the x axis
and the mutated column with our false discovery rate corrected p-values on the y-axis and set
the significant p-values off from the non-significant by size and color. We can also go ahead
and customize the size and color scales, since we have learned how to do that.
ggplot(data=dexp_sigtrnsc) +
geom_point(aes(x = logFC, y = log10(FDR), color = Significant,
size = Significant,
alpha = Significant)) +
scale_color_manual(values = c("black", "#e11f28")) +
scale_size_discrete(range = c(1, 2))
Immediately, you should notice that the figure is upside down compared to what we would
expect from a volcano plot. there are two possible ways to fix this. We could transform the FDR
corrected values by multiplying by -1 OR we could work with our axes scales. Aside from text
modifications, we haven't yet changed the scaling of the axes. Let's see how we can modify the
scale of the y-axis.
ggplot(data=dexp_sigtrnsc) +
geom_point(aes(x = logFC, y = log10(FDR), color = Significant,
size = Significant,
alpha = Significant)) +
This looks pretty good, but we can tidy it up more by working with our legend guides and our
theme.
Modifying legends
We can modify many aspects of the figure legend using the function guide(). Let's see how
that works and go ahead and customize some theme arguments. Notice that the legend
position is specified in theme().
ggplot(data=dexp_sigtrnsc) +
geom_point(aes(x = logFC, y = log10(FDR), color = Significant,
size = Significant,
alpha = Significant)) +
scale_color_manual(values = c("black", "#e11f28")) +
scale_size_discrete(range = c(1, 2)) +
scale_y_reverse(limits=c(0,-7)) + #we can also set the limits
guides(size = "none", alpha= "none",
color= guide_legend(title="Significant DE")) +
theme_bw() +
theme(
panel.border = element_blank(),
axis.line = element_line(),
panel.grid.major = element_line(size = 0.2),
panel.grid.minor = element_line(size = 0.1),
text = element_text(size = 12),
legend.position = "bottom",
axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1)
)
Lastly, let's layer another geom function to label our top six differentially abundant genes based
on significance. We can use geom_text_repel() from library(ggrepel), which is a
variation on geom_text().
plot_de<-ggplot(data=dexp_sigtrnsc) +
geom_point(aes(x = logFC, y = log10(FDR), color = Significant,
size = Significant,
alpha = Significant)) +
geom_text_repel(data=dexp_sigtrnsc %>%
filter(transcript %in% topgenes),
aes(x = logFC, y = log10(FDR),label=transcript),
nudge_y=0.5,hjust=0.5,direction="y",
segment.color="gray")+
scale_color_manual(values = c("black", "#e11f28")) +
scale_size_discrete(range = c(1, 2)) +
scale_y_reverse(limits=c(0,-7)) + #we can also set the limits
guides(size = "none", alpha= "none",
color= guide_legend(title="Significant DE")) +
theme_bw() +
theme(
panel.border = element_blank(),
axis.line = element_line(),
panel.grid.major = element_line(size = 0.2),
plot_de
saveRDS(plot_de, "volcanoplot.rds")
References
Code for PCA was adapted from Learning R through examples (https://gexijin.github.io/learnR/
index.html) by Xijin Ge, Jianli Qi, and Rong Fan, 2021. Other sources for content included R4DS
(https://r4ds.had.co.nz/) and Holmes and Huber, 2021 (https://web.stanford.edu/class/bios221/
book/).
• one or more geometric objects that serve as the visual representations of the
data, – for instance, points, lines, rectangles, contours,
• descriptions of how the variables in the data are mapped to visual properties
(aesthetics) of the geometric objects, and an associated scale (e. g., linear,
logarithmic, rank),
• optional parameters that affect the layout and rendering, such text size, font
and alignment, legend positions.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>
Bioinformatics Training and Education Program
63 Stat Transformations: Bar plots, box plots, and histograms
) +
<FACET_FUNCTION> +
<COORDINATE SYSTEM> +
<THEME>
While we added stat_ellipse() to our PCAs, scatter plots do not have a built in statistical
transformation. Today, we will talk more about statistical transformations and how these impact
our plot representations. For a list of available statistical transformations in ggplot2 see https://
ggplot2-book.org/layers.html?q=stat#stat (https://ggplot2-book.org/layers.html?q=stat#stat).
Libraries
In this lesson, we will continue to use the ggplot2 package for plotting. To use ggplot2, we first
need to load it into our R work environment using the library command.
library(ggplot2)
The Data
In this lesson we will use data obtained from a study that examined the effect that dietary
supplements at various doses have on guinea pig tooth length. This data set is built into R, so if
you want to take a look for yourself you can type data("ToothGrowth") either in the console
or in a script. But in this exercise, we will import the data to our R work environment because it
is more likely that we will import our own data for analysis.
Here, we are going to import two data sets using read.delim. The file data1.txt contains raw
data from the tooth growth study. The file data2.txt is summary level data with mean tooth length
and standard deviation pre-computed. We will assign data1.txt to object a1 and data2.txt to
object a2. Within read.delim we use sep='\t' to indicate that the columns in the data are
tab separated. After importing, we use the head command to look at the first 6 rows of each
data set.
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
head(a2)
The tooth growth data set measured tooth length for two supplement types (OJ - orange juice,
VC - vitamin c) at three different doses (0.5, 1, and 2). Each supplement and dose combination
has 10 measurements so we have a total of 60 measurements in this data set. In a1, we have
the raw data. On the other hand, in a2, we pre-computed the mean tooth length and standard
deviation for the 10 measurements taken at each supplement and dose combination.
The column headings (colnames(a1),colnames(a2)) in the raw data (a1) and summary
level data (a2) are as follows:
Variable Types
Before diving into the construction of bar plot, box & whisker plot, and histogram, we should do
a quick review of the types of variables that we commonly work with in data analysis.
• age
• height
• weight
• Independent variable is a variable whose variation does not depend on another
• Dependent variable is a variable whose variation depends on another
• Factors are independent variables in which an experimental outcome is dependent on.
Levels are variations in the factors.
Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs,
like bar charts, calculate new values to plot:
• bar charts, histograms, and frequency polygons bin your data and then plot
bin counts, the number of points that fall in each bin.
• smoothers fit a model to your data and then plot predictions from the model.
Let's explore the tooth growth data using plots. The tooth growth data has two independent
variables, supplement and dose (variables supp and dose, respectively) for which the
dependent variable tooth length (len) is measured. We would like to use the scatter plot to learn
about tooth growth as a function of both dose and supplement (supp). Remember from Lesson
3 that a scatter plot can be generated using geom_point.
Within the aesthetic mapping in geom_point, we assign dose to the x axis, len (tooth length)
to y, and the second independent variable supp (supplement) by assigning it to color. This
will give us a scatter plot where dose is plotted along the x axis and len plotted along the y axis.
The color code indicating which supplement (supp) each of the points or measurements came
from is provided in the legend. In short, the color argument allows us to visualize how the
dependent variable changes as a function of two independent variables.
ggplot(data=a1)+
geom_point(mapping=aes(x=dose,y=len,color=supp),
position=position_dodge(width=0.25))
Bar Plot
A barplot is used to display the relationship between a numeric and a categorical
variable. --- R Graph Gallery (https://r-graph-gallery.com/barplot.html)
The tooth growth data can also be visualized via bar plot using geom_bar. However, if we try to
plot tooth length (len) across each dose using the code below, we will get an error. We get this
error because geom_bar uses stat_count by default where it counts the number of cases at
each x position. Thus, by default, geom_point does not require y axis.
ggplot(data=a1)+
geom_bar(mapping=aes(x=dose,y=len))
## Error in `f()`:
## ! stat_count() can only have an x or y aesthetic.
stat = count
Let's take a look at a bar plot constructed using the default stat="count" transformation.
Below, we plot the number of tooth length measurements taken at each dose. Setting
color="black" allows us to include a black outline to the bars for better readability. In
accordance with the description of this data, we see from the plot that 20 measurements were
taken at each dose.
ggplot(data=a1)+
geom_bar(mapping=aes(x=dose), color="black")
Given the above plot, how many of the 20 measurements taken at each dose came from the OJ
or VC group. To find out, we can set fill=supp. Like color in the scatter plot, fill, allows
us to include a second independent variable in our graph. The plot below tells us that 10
measurements were taken from the OJ group and 10 were taken from the VC group at each
dose.
ggplot(data=a1)+
geom_bar(mapping=aes(x=dose, fill=supp), color="black")
By default, geom_bar stacks bars from different groups. If we do not like the arrangement, we
can use position_dodge to arrange the bars from the OJ and VC groups side-by-side.
ggplot(data=a1)+
geom_bar(mapping=aes(x=dose,fill=supp), color="black",
position=position_dodge())
stat = identity
Above, we learned about the number of tooth length measurements taken at each dose and
supplement combination using the default stat="count" transformation of geom_bar. But
what if we want to specify a y axis and plot exactly the values of our dependent variable in the y
axis? This can be done in geom_bar by setting stat="identity".
Below, we are plotting the mean tooth length (mean_len) across each of the treatment groups
(treat) using the summary level data, a2. Using stat="identity", the exact y value or
mean_len is plotted.
ggplot(data=a2)+
geom_bar(mapping=aes(x=treat,y=mean_len),stat="identity")
What if we wanted to look at the raw values across the treatment groups using a bar plot? We
still use geom_bar but the aesthetic mapping will be similar to the scatter plot except we are
filling the bars to provide color coding for the supplements using fill. Again, because we
want ggplot2 to plot exact value in the y axis, we specify stat="identity" inside geom_bar.
To avoid stacking the values, we can use position_dodge2 in geom_bar to visualize each of
the 10 measurements taken at each supplement and dose combination arranged side-by-side.
ggplot(data=a1)+
geom_bar(mapping=aes(x=dose,y=len,fill=supp),stat="identity",
position=position_dodge2(),color="black")
Note that because R interprets dose as numeric continuous variable (class(a2$dose)) ggplot2
gives us an extra dose of 1.5. But, the study did not measure tooth length at a dose of 1.5 for
either of the supplements. Thus, we would like to remove this dose.
Using factors
class(a1$dose)
## [1] "numeric"
Using the factor function we see that there are three levels for dose (0.5, 1, and 2)
factor(a1$dose)
## [1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 1 1 1 1 1 1 1
## [20] 1 2 2 2 2 2 2 2 2 2 2 0.5 0.5 0.5 0.5 0.5 0.5 0.5
## [39] 0.5 0.5 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2
## [58] 2 2 2
## Levels: 0.5 1 2
ggplot(data=a1)+
geom_bar(mapping=aes(x=factor(dose),y=len,fill=supp),stat="identity"
,position=position_dodge2(),color="black")
We can reorder factors using the function factor. For instance if we want to plot the doses
backwards from highest to lowest on the x axis (i.e., 2, 1, 0.5) we can set the x axis in aesthetic
mapping of geom_bar to factor(dose,levels=c(2,1,0.5)), where in the levels
parameter, we are reassigning the order of the levels.
ggplot(data=a1)+
geom_bar(mapping=aes(x=factor(dose,
levels=c(2,1,0.5)),y=len,fill=supp),stat="identity",
position=position_dodge2(),color="black")
First, we use the summary level data (a2) to plot the mean tooth length (mean_len) across the
treatment groups, remembering to set stat="identity" because we are providing a y axis
and position=position_dodge to arrange the bars side-by-side.
ggplot(data=a2, mapping=aes(x=factor(dose),y=mean_len,fill=supp))+
geom_bar(stat="identity", position=position_dodge())
Next, we add geom_errorbar to incorporate error bars that illustrate plus/minus 1 standard
deviation from the mean. To set the upper and lower bounds of the error bar, we simply set
ymax=mean_len+sd and ymin=mean_len-sd within the aesthetic mapping of
geom_errorbar. Setting position=position_dodge in geom_errorbar again allows us
to separate the error bars from each of the supplement groups. Within the position_dodge
argument we set width=0.9 to help center the error bars with their respective bars. The
width parameter within geom_errorbar allows us to adjust the error bar width (set to 0.1
here).
ggplot(data=a2, mapping=aes(x=factor(dose),y=mean_len,fill=supp))+
geom_bar(stat="identity", position=position_dodge())+
geom_errorbar(aes(ymax=mean_len+sd,ymin=mean_len-sd),
position=position_dodge(width=0.9), width=0.1)
Histogram
Understanding data distribution can help us decide appropriate downstream steps in analysis
such as which statistical test to use. A histogram is a good way to visualize distribution. It
divides the data into bins or increments and informs the number of occurrences in each of the
bins. Thus, the default statistical transformation for geom_histogram is stat_bin, which bins
the data into a user specified integer (default is 30) and then counts the occurrences in each
bin. In geom_histogram we have the ability to control both the number of bins through the
bins argument or binwidth through the binwidth argument. Important to note is that
stat_bin only works with continuous variables.
Below we constructed a basic histogram using the len column in a1 (the raw data for the Tooth
Growth study). Note that within geom_histogram we do not need to explicitly state
stat="bin" because it is default. The histogram below is not very aesthetically pleasing -
there are gaps and difficult to see the separation of the bins.
ggplot(data=a1, mapping=aes(x=len))+
geom_histogram()
First, we will use the color argument in geom_histogram to assign a border color to help
distinguish the bins. Then we use the fill argument in geom_histogram to change the bars
associated with the bins to a color other than gray. Below we have a histogram of tooth length
with a default bin of 30.
ggplot(data=a1, mapping=aes(x=len))+
geom_histogram(color="black", fill="cornflowerblue")
ggplot(data=a1, mapping=aes(x=len))+
geom_histogram(color="black", fill="cornflowerblue", bins=7)
From the above, we see that altering the number of bins alters the binwidth (ie. the range in
which occurrence are counted). Thus, altering the bins can influence the distribution that we
see. The histograms above seem to be left skewed and a lot of the tooth length values fall
between 22.5 and 27.5 when 7 bins were used.
Box plot
Box and whisker plots also show data distribution. Unlike a histogram, we can readily see
summary statistics such as median, 25th and 75th percentile, and maximum and minimum.
The default statistical tranformation of a box plot in ggplot2 is stat_boxplot which calculates
components of the boxplot. To construct a box plot in ggplot2, we use the geom_boxplot
argument. Note that within geom_boxplot we do not need to explicitly state stat="boxplot"
because it is default. Below, we have a default boxplot depicting tooth length across the
treatment groups. Potential outliers are presented as points.
Rather than showing the outliers, we could instead add on geom_point to overlay the data
points. Here, in geom_boxplot we set outlier.shape=NA to remove the outliers for the
purpose of avoiding duplicating a data point. Within geom_point we set the position of the
points to position_jitterdodge to avoid overlapping of points whose values are close
together (set by jitter.width) and overlapping of points from measurements derived from
different supplements (dodge.width).
• Tooth length appears to be longer for the OJ treated group at doses of 0.5 and 1
• Tooth length appears to be equal for both the OJ and VC groups at a dose of 2
• At a dose of 0.5 and 1, the median (line inside the box) for the OJ group is larger than the
VC group
• At a dose of 0.5 and 1, the interquartile range (IQR) which is defined by the lower (25th
percentile) and upper (75th percentile) bounds of the box along the vertical axis is larger
for the OJ group as compared to VC - so there is more variability in the OJ group
measurements.
• At a dose of 2, the median for both the OJ and VC group are roughly equal
• At a dose of 2, the IQR for the VC group is larger than that for the OJ group
1. Introduce the heatmap and dendrogram as tools for visualizing clusters in data.
What is a heatmap?
A heatmap is a graphical representation of data where the individual values
contained in a matrix are represented as colors. --- R Graph Gallery (https://r-
graph-gallery.com/heatmap.html)
Heatmaps are appropriate when we have lots of data because color is easier to interpret and
distinguish than raw values. --- Dundas BI (https://www.dundas.com/resources/blogs/best-
practices/when-and-why-to-use-heat-maps)
What is a dendrogram?
A dendrogram (or tree diagram) is a network structure and can be used to visualize
hierarchy or clustering in data. --- R Graph Gallery (https://r-graph-gallery.com/
dendrogram.html)
Applications of dendrograms
Dendrograms are used in phylogenetics to help visualize relatedness of or dissimilarities
between species.
In RNA sequencing, dendrogram can be combined with heatmap to show clustering of samples
by gene expression or clustering of genes that are similarly expressed (Figure 1).
Figure 1: Heatmap and dendrogram showing clustering of samples with similar gene expression
and clustering of genes with similar expression patterns.
Further heatmap and dendrogram can be used as a diagnostic tool in high throughput
sequencing experiments. As an example, we can look at the heatmap and dendrogram in
Figure 2. In Figure 2, the heatmap shows correlation of RNA sequencing samples with the idea
that biological replicates should be more highly correlated compared to samples between
treatment groups. The dendrogram clusters similar samples together. Figure 2 tells us that
heatmaps can also be used to visualize correlation.
• It appears that this by default does not generate a legend showing the correlation
between values and color.
• It also appears that assigning distance calculation and clustering methods are not
intuitive for the users.
• It appears assigning distance calculation and clustering methods are not intuitive for the
users.
• Click here to learn about heatmap.2 (https://cran.r-project.org/web/packages/gplots/
index.html)
ComplexHeatmap
• There is no scaling option so the user will have to scale the data separately using scale.
(see https://support.bioconductor.org/p/68340/ and https://github.com/jokergoo/
ComplexHeatmap/issues/313 for a discussion on scaling in ComplexHeatmap).
• Click here to learn about ComplextHeatmap (https://www.bioconductor.org/packages/
release/bioc/html/ComplexHeatmap.html).
pheatmap
heatmaply
• This package generates interactive heatmaps that allows the user to mouse over a tile to
see information such as sample id, gene, and expression value.
• Click here to learn about heatmaply (https://cran.r-project.org/web/packages/heatmaply/
vignettes/heatmaply.html)
While most of the tools listed above can be used to produce publication quality heatmaps, we
find that pheatmap is perhaps the most comprehensive. Therefore, in this class, we will show
how to construct heatmaps using pheatmap. Because heatmaps can be filled with a lot of data,
we will also demonstrate the use of heatmaply to construct interactive heatmaps that you could
use to explore your data more efficiently.
Import data
The data that we will be working with comes from the airway study that profiled the
transcriptome of several airway smooth muscle cell lines under either control or dexamethasone
treatment Himes et al 2014 (http://www.ncbi.nlm.nih.gov/pubmed/24926665) . The dataset is
available from Bioconductor (https://bioconductor.org/packages/release/data/experiment/html/
airway.html).
Specifically, our dataset represents the normalized (log2 counts per million or log2 CPM) of
count values from the top 20 differential expressed genes. This data is saved as the comma
separated file RNAseq_mat_top20.csv and thus we will be using the read.csv command to
import.
mat<-read.csv("./data/RNAseq_mat_top20.csv",header=TRUE,row.names=1,
check.names=FALSE)
We will now use head to look at the first 6 rows of mat. The column headings represent sample
names and the row names are the genes.
head(mat)
pheatmap(mat)
Distance calculation
The idea behind cluster analysis is to calculate some sort of distance between objects in order
to identify the ones that are closer together. When two objects have a small distance, we can
conclude they are closer and should cluster together. On the other hand, two objects that are
further apart will have a larger distance. There are various approaches to calculating distance in
cluster analysis so considerations should be taken for choosing the appropriate one. To learn
more about distance calculation methods as well as advantages and disadvantages of each
see Shirkhorshidi et al, PLOS ONE, 2015 (https://journals.plos.org/plosone/article?id=10.1371/
journal.pone.0144059). In pheatmap, we can specify the clustering distance using either the
clustering_distance_rows argument or clustering_distance_cols depending on
whether we would like to cluster by row or column variables.
Cluster generation
After the distance matrix has been calculated, it is time to perform the actual clustering and
again, various approaches can be used to generate clusters. The following resources are good
for learning about the variouse hierarchical clustering methods. In pheatmap, the clustering
method is specified by the clustering_method argument.
• https://hlab.stanford.edu/brian/forming_clusters.htm (https://hlab.stanford.edu/brian/
forming_clusters.htm)
• https://dataaspirant.com/hierarchical-clustering-algorithm/ (https://dataaspirant.com/
hierarchical-clustering-algorithm/)
• https://www.learndatasci.com/glossary/hierarchical-clustering/ (https://
www.learndatasci.com/glossary/hierarchical-clustering/)
Scaling
Prior to sending our data into the heatmap generating algorithm, it is a good idea to sacle.
There are several reasons for doing this
• Variables in the data might not have the same units, thus without scaling we will be, to
borrow a phrase, comparing apples to oranges
• Scaling allows us to discern patterns in variables with low values when plotting on the
color scale. Without scaling, variables with large values will drown out the signal from
those with low values. We will see an example of this using the mtcars data.
• Scaling also prevents variables with large values from contributing too much weight to
distance https://medium.com/analytics-vidhya/why-is-scaling-required-in-knn-and-k-
means-8129e4d88ed7 (https://medium.com/analytics-vidhya/why-is-scaling-required-in-
knn-and-k-means-8129e4d88ed7). Without scaling, it will hard to discern whether
variables with lower values contribute to separation.
A common method for scaling is to use the z score (see z score formula), which tells us how
many standard deviations away from the mean is a given value in our data. This is the scaling
method for pheatmap.
Below, we will use the mtcars data to look at how scaling influences a heatmap. Because
mtcars is built into R, we can use the data command to load it and we will save this as an
object named cars. Next, we will use the head command to view the first few rows of the cars
data.
data(mtcars)
cars <- mtcars
head(cars)
Note that variables like disp and hp has larger magnitudes as compared to one like mpg. Also,
the variables in this data does not have the same units. If we constructed a heatmap of the
mtcars data without scaling, we will not be able to discern patterns in variables like mpg among
the samples. This is because the values for mpg are small in comparison to those for disp and
hp, they get squeeze towards the bottom of color scale.
pheatmap(cars)
However, if we scaled then it becomes easier to observe differences in values for each of the
variables. We are interested in the differences in each variable across the car types, thus, we
scale by column because the sample names (ie. car brands) are listed down the rows. If the car
brands were listed across columns, we would have scaled by row.
pheatmap(cars, scale="column")
Getting back to RNA seq and something more biologically relevant, we can scale by row in mat.
pheatmap(mat, scale="row")
pheatmap(mat,scale="row",
color=colorRampPalette(c("navy", "white", "red"))(50))
Using %>%, we pass the dfh data frame to the column_to_rownames command to set the
rownames of the dfh data frame to the sample IDs. Finally, we will change the values in the dex
column to either untrt (untreated) or trt (treated) using ifelse to check if the row names of dfh
(rownames(dfh)) are samples 508, 512, 516, or 520. If yes, then the value for these samples
in the dex column becomes untrt. In other words, 508, 512, 516, and 520 are untreated
samples. Else, we will assign the samples to the trt group, which indicates they are treated.
"untrt","trt")
dfh
## dex
## 508 untrt
## 509 trt
## 512 untrt
## 513 trt
## 516 untrt
## 517 trt
## 520 untrt
## 521 trt
We can add the sample treatment annotation by setting the annotation_col argument to dfh
in pheatmap. We use annotation_col rather than annotation_row because the samples
IDs are listed along the horizontal axis so essentially corresponding to the columns of the
heatmap. The result is that the samples are now color coded by the treatment group in which
they belong and this color coding is provided in the legend.
Using the annotation_colors argument, we can reassign the colors of the sample to
treatment mapping legend.
If we include cutree_rows=2, then the heatmap will be split into two rows. Note that it is split in a
way that the top row represents genes that are down-regulated in the treated group and up-
regulated in the untreated group. The bottom row represents those genes that are up-regulated
by dexamethasone treatment but down-regulated when not treated.
A title for our heatmap can be included using the main argument
The last few customization we will do with the heatmap is to adjust the fontsize using argument
fontsize. We will adjust the cellwidth to move the treatment legend into the plot canvas.
We also use cellheight to adjust the height of the heatmap to fill more of the plotting canvas.
Saving a non-ggplot
Recall that we can use the ggsave command to save a ggplot. However, heatmaps generated
using the pheatmap package are not ggplots, therefore we need to turn to either the image
export feature (Figure 7) in R studio or use one of the several image saving commands.
• jpeg
• bmp
• tiff
• png
• pdf
All of these take the file name in which we would like to save the image, resolution (res), image
width (width), image height (height), and units of image dimension (unit) as arguments.
Below, we use png to save our heatmap as file pheatmap_1.png at 300 dpi as specified in
res. The workflow is to first create the file using one of the image save commands, then
generate the plot, and set dev.off() to turn off the current graphical device. If we do not set
dev.off(), subsequent plots will overwrite the file that we just saved and will not show up in the
plot pane.
dev.new()
png("./data/pheatmap_1.png", res=300, width=7, height=4.5, unit="in")
pheatmap(mat,scale="row", annotation_col = dfh,
annotation_colors=list(dex=c(trt="orange",untrt="black")),
color=colorRampPalette(c("navy", "white", "red"))(50),
cutree_cols=2, cutree_rows=2,
main="Expression and clustering of top DE genes",
fontsize=11, cellwidth=35, cellheight=10.25)
dev.off()
## quartz_off_screen
## 3
hm_gg<-as.ggplot(pheatmap(mat,scale="row", annotation_col=
dfh,annotation_colors =list(dex=c(trt="orange",
untrt="black")),color=colorRampPalette(c("navy",
"white", "red"))(50)))
Save as an R object
Below, we assign the heatmap to the R object hm_ph and we can import this back to R in the
future.
saveRDS(hm_ph, file="./data/airways_pheatmap.rds")
We will wrap up lesson 5 with an introduction to the package heatmaply, which can be used to
generate interactive heatmaps. Below is a basic interactive heatmap generated using this
package for the airway top 20 differentially expressed genes (mat).
Similar to pheatmap, we will start by providing heatmaply the data that we would like to plot. We
can also scale by row like we did in pheatmap. Plot margins can also be set to ensure the entire
plot fits on the canvas. Like in pheatmap, we assign the plot title using the argument main.
Setting col_side_colors to the data frame dfh, which contains the sample to treatment
mapping, creates a legend that spans the columns of the heatmap, informing us of the
treatment group to which the samples belong.
heatmaply example
Objectives
1. Combine multiple plots into a single figure
The primary purpose of this lesson is to learn how to combine multiple figures into a single
multi-panel figure using patchwork and a few features of cowplot. While we will learn how to
customize and arrange plots in a multi-figure panel, this is not a comprehensive lesson on all
aspects of patchwork and cowplot. If you have something specific in mind for your own
data, I implore you to read the documentation for these packages to understand their full
potential for customization.
Science 47.73 6
Example Multi-figure panel from Zhang et al.(2022). Longitudinal single-cell RNA-seq analysis
reveals stress-promoted chemoresistance in metastatic ovarian cancer. Science advances, 8(8),
eabm1831.
To get started, load the libraries. All packages used today can be installed from CRAN using
install.packages().
#Get patchwork
#To install use install.packages('patchwork')
#load
library(patchwork)
#Get cowplot
#To install use install.packages('cowplot')
#load
library(cowplot)
The Data
This is the sixth lesson in our Data Visualization with R Series. At this point, we have created
quite a few plots. For this lesson, we will focus on the RNA-Seq plots that we created in previous
lessons. We will also include other related plots that were created using the same RNA-Seq
data, but were not created throughout this course series. All plots were saved as R objects
(.rds). To load the data into R, we will need to use the readRDS() function.
Let's load and view our plots. To view our plots, we can simply call the objects by name.
pca<-readRDS("./data/airwaypca.rds")
volcano<-readRDS("./data/volcanoplot.rds")
hmap<-readRDS("./data/airwayhm.rds")
sc<-readRDS("./data/stripchart.rds")
#view objects
pca
volcano
hmap
sc
What is cowplot?
The cowplot package provides various features that help with creating publication-
quality figures, such as a set of themes, functions to align plots and arrange them
into complex compound figures, and functions that make it easy to annotate plots
and or mix plots with images. The package was originally written for internal use in
the Wilke lab, hence the name (Claus O. Wilke’s plot package). --- cowplot 1.1.1
(https://wilkelab.org/cowplot/index.html)
The main function to combine figures using cowplot is plot_grid(). Let's check out the help
documentation using ?plot_grid(). The first and most important parameter is the list of plots
we want to combine, plotlist.
Let's check out the basic use of this function by calling the plots we want to combine and by
providing labels using the labels argument.
plot_grid(pca,volcano,hmap,sc, labels="AUTO")
This figure isn't bad. Though, we would likely want to change the relative sizes of the plots and
work on the plot alignments. This can be done with rel_heights, rel_widths, align, and
axis. We are not going to work with these today because cowplot does not do well with plot
alignments when a given plot's aspect ratio has been fixed (e.g., coord_fixed(),
coord_equal()).
However, I do want to point out some of the options for label customization before moving on to
ggdraw().
There are quite a few parameters to adjust figure labels. To re-position labels, see label_x,
label_y, hjust, and vjust. These each take either a single value to move all labels or a
vector of values, one for each subplot. We can also change the size of the labels
(label_size), the font (label_fontface), the label color (label_colour), and the font
type (label_fontfamily). In general, there seem to be more options for plot customization
(without adding ggplot2 layers) with cowplot.
Here we changed the size of the labels to 14 pt and the color to blue. Also, the labels are now
bold and italicized, and the font is Times New Roman. We also re-positioned the labels.
You can nest figures by combining figures using plot_grid(), saving that to an object, and
then plotting those pre-combined figures with another figure using plot_grid() again. Shared
legends can be obtained using cowplot's get_legend(). cowplot also has its own function
to save plots (save_plot()), which is a bit more dynamic for multi-figure panels, but you may
also use ggsave().
Note: Grobs are graphical objects that you can make and change with grid
graphics functions. The ggplot2 package is built on top of grid graphics, so the grid
graphics system “plays well” with ggplot2 objects. --- Pang, Kross, and Andersen,
2020 (https://bookdown.org/rdpeng/RProgDA/the-grid-package.html)
ggdraw() creates a new ggplot2 canvas without visible axes or background grid.
The draw_* functions are simply wrappers around regular geoms. --- cowplot
documentation (https://wilkelab.org/cowplot/articles/drawing_with_on_plots.html)
pca + draw_image(plos,x=-15,y=-2,width=30,height=20)
p1<-plot_grid(NULL,a,NULL,(pca+theme(legend.position="right")),labels=c("","","
bc<-plot_grid(volcano,sc,labels = c("B","C"),align="h",axis="b")
## Warning: Using alpha for a discrete variable is not advised.
p2<-plot_grid(p1,NULL,bc,ncol=1,align="v",axis="l",rel_heights = c(1,0.05,1.75)
p2b<-plot_grid(hmap,NULL,nrow=1,align="h",axis="b",rel_widths=c(0.65,0.25))
p3<-plot_grid(NULL,p2,NULL,p2b,ncol=1,labels=c("","","D"),align="v",axis="l",re
labp<-ggdraw(p3)+
draw_label("Airway Data", color = "Black", size = 14, x=0.1,y=0.97,hjust=0.6
labp
What is patchwork?
The goal of patchwork is to make it ridiculously simple to combine separate ggplots
into the same graphic. As such it tries to solve the same problem as
gridExtra::grid.arrange() and cowplot::plot_grid but using an API that incites
exploration and iteration, and scales to arbitrily complex layouts. ---Thomas Lin
Pederson, Patchwork documentation (https://patchwork.data-imaginist.com/
index.html).
Patchwork allows users to combine plots using simple mathematic operations such as + and /.
pca + volcano
The last plot included in patchwork statements is considered the active plot, to which we can
add additional ggplot2 layers. Notice the seamless alignment of these plots. Without any
additional parameters, coord_fixed() is maintained in the pca plot. patchwork does a lot
better with a fixed aspect plot.
pca + volcano +
theme(legend.position="right")
We can continue to add plots using the + symbol, and patchwork will try to form a grid,
proceeding from left to right row-wise. Let's see this in action.
The plot layout can be controlled further with additional operators. The | symbol is used to
place plots side by side, while the / symbol is used to stack plots vertically. Plotting layouts
kind of follow the rules designated by the order of operations (Remember back to PEMDAS),
the / occurs before | and +.
#vertical stacking
pca / volcano
#horizontal
pca | volcano
pca | volcano / sc
For specific layouts, it is a good idea to use parentheses for correct evaluation.
For example, what if we want our pca and heatmap side by side stacked on top of our box plot?
pca | hmap / sc
Using plot_layout()
The function plot_layout() can be used to control the layout, combine legends, and
overwrite plot titles.
#combine legends
pca1 + sc1 + plot_layout(ncol =1, guides = 'collect')
Changing the relative sizes of the figures alters the alignment. Units may also be specified to
control the height or width.
It is possible to design unique (non-grid) layouts using patchwork, but it seems a bit more
difficult than cowplot in that regard.
Adding a spacer
We can add a spacer using plot_spacer() to add blank sections to our plot. These blank
sections are the size of our figure panels and may be different depending on how the plot is
arranged.
Including a title
patchf
Notice that the assignment of "A" and "B" were automatic with the tag_levels parameter. The
parentheses here are required. Also, & was used to apply theme() to all subplots in the
patchwork.
ggsave("patchf.png",height=5,width=4.25,dpi=300,units="in",scale=2)
Acknowledgements
Content in this tutorial was adapted from information in the cowplot documentation (https://
wilkelab.org/cowplot/articles/plot_grid.html) and patchwork documentation (https://
patchwork.data-imaginist.com/).
Course Data
Course data is available in the attached zipped archive: data.zip.
Column Description
Survived 0 = No, 1 = Yes
Load ggplot2.
library(ggplot2)
Exercise Questions
Question 1
{{Sdet}}
Possible Solution{{Esum}}
titanic <- read.csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166
{{Edet}}
Question 2
Explore the data. What is the structure of the data? Try str(). What are the column names? Try
colnames(). How can you get help if you do not know how to use these functions?
{{Sdet}}
Possible Solution{{Esum}}
{{Edet}}
Question 3
Make a simple scatter plot. Is there a relationship between the age of the passenger and the
passenger fare?
{{Sdet}}
Possible Solution{{Esum}}
ggplot(titanic) +
geom_point(aes(x=Age, y=Fare))
{{Edet}}
Question 4
Color the points from question 3 by Pclass. Remember that Pclass is a proxy for socioeconomic
status. While the values are treated as numeric upon loading, they are really categorical and
should be treated as such. You will need to coerce Pclass into a categorical (factor) variable.
See factor() and as.factor().
{{Sdet}}
Possible Solution{{Esum}}
ggplot(titanic) +
geom_point(aes(x=Age, y=Fare, color=as.factor(Pclass)))
{{Edet}}
Question 5
Manually scale the colors in question 4. 1st class = yellow, 2nd class = purple, 3rd class =
seagreen. Also change the legend labels (1 = 1st Class, 2 = 2nd Class, 3 = 3rd Class).
{{Sdet}}
Possible Solution{{Esum}}
ggplot(titanic) +
geom_point(aes(x=Age, y=Fare, color=as.factor(Pclass)))+
scale_color_manual(values=c("yellow","purple","seagreen"),
labels=c("1st Class","2nd Class","3rd Class"))
{{Edet}}
Question 6
Facet the plot made in 5 by the column 'Sex'.
{{Sdet}}
Possible Solution{{Esum}}
ggplot(titanic) +
geom_point(aes(x=Age, y=Fare, color=as.factor(Pclass))) +
scale_color_manual(values=c("yellow","purple","seagreen"),
labels=c("1st Class","2nd Class","3rd Class")) +
facet_wrap(~Sex)
{{Edet}}
Challenge question 1
Let's use some other geoms. Plot the number of passengers (a simple count) that survived by
ticket class and facet by sex.
{{Sdet}}
Possible Solution{{Esum}}
ggplot(titanic) +
geom_bar(aes(x=Pclass, fill=factor(Survived)),
position=position_dodge()) +
facet_wrap(~Sex)+
labs( y="Number of Passengers", x="Passenger Class",
title="Titanic Survival Rate by Passenger Class")
{{Edet}}
Challenge question 2
Add a variable to the data frame called age_cat (child = <12, adolescent = 12-17,adult= 18+).
Plot the number of passengers (a simple count) that survived by age_cat, fill by Sex, and facet
by class and survival.
{{Sdet}}
Possible Solution{{Esum}}
library(dplyr)
##
## Attaching package: 'dplyr'
titanic %>%
mutate(age_cat= case_when(Age < 12 ~ "child",
Age >= 12 & Age < 18 ~ "adolescent",
Age >= 18 ~ "adult"
)) %>%
ggplot() +
geom_bar(aes(x=age_cat, fill=factor(Sex)),
position=position_dodge()) +
facet_grid(Pclass~Survived)+
labs( y="Number of Passengers", x="Age Category",
title="Titanic Survival")
{{Edet}}
Let's use the dataset mtcars. According to the help documentation (?mtcars), "the data was
extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10
aspects of automobile design and performance for 32 automobiles (1973–74 models)." Each
question below will depend on code from the previous question.
Question 1
Let's check out the structure of the data.
{{Sdet}}
Possible Solution{{Esum}}
str(mtcars)
{{Edet}}
Question 2
How might we plot automobile weight (wt) versus miles per gallon (mpg).
{{Sdet}}
Possible Solution{{Esum}}
{{Edet}}
Question 3
What if we want to represent the number of cylinders (cyl) by color and shape?
{{Sdet}}
Possible Solution{{Esum}}
{{Edet}}
Question 4
Make the size of the points change by the quarter mile time (qsec).
{{Sdet}}
Possible Solution{{Esum}}
{{Edet}}
Question 5
Create subplots by transmission (am).
{{Sdet}}
Possible Solution{{Esum}}
{{Edet}}
Question 6
Model the trend using geom_smooth(). What is the default method used by geom_smooth()?
{{Sdet}}
Possible Solution{{Esum}}
{{Edet}}
Your mission is to make a publishable figure using the iris data set.
Question 1
{{Sdet}}
Possible Solution{{Esum}}
library(ggplot2)
ggplot(iris)+
geom_point(aes(Petal.Length,Petal.Width,color=Species))
{{Edet}}
Question 2
Fix the axes so that the dimensions on the x-axis and the y-axis are equal. Both axes should
start at 0. Label the axis breaks every 0.5 units on the y-axis and every 1.0 units on the x-axis.
{{Sdet}}
Possible Solution{{Esum}}
ggplot(iris)+
geom_point(aes(Petal.Length,Petal.Width,color=Species))+
coord_fixed(ratio=1,ylim=c(0,2.75),xlim=c(0,7),expand=FALSE) +
scale_y_continuous(breaks=c(0,0.5,1,1.5,2,2.5)) +
scale_x_continuous(breaks=c(0,1,2,3,4,5,6,7))
{{Edet}}
Question 3
Change to color of the points by species to be color blind friendly, and change the legend title
to "Iris Species". Label the x and y axis to eliminate the variable names and add unit information.
{{Sdet}}
Possible Solution{{Esum}}
ggplot(iris)+
geom_point(aes(Petal.Length,Petal.Width,color=Species))+
coord_fixed(ratio=1,ylim=c(0,2.75),xlim=c(0,7),expand=FALSE) +
scale_y_continuous(breaks=c(0,0.5,1,1.5,2,2.5)) +
scale_x_continuous(breaks=c(0,1,2,3,4,5,6,7)) +
scale_color_brewer(palette = "Dark2",name="Iris Species") +
labs(x="Petal Length (cm)", y= "Petal Width (cm)")
{{Edet}}
Question 4
Play with the theme to make this a bit nicer. Change font style to "Times". Change all font sizes
to 12 pt font. Bold the legend title and the axes titles. Increase the size of the points on the plot
to 2. Bonus: fill the points with color and have a black outline around each point.
{{Sdet}}
Possible Solution{{Esum}}
ggplot(iris)+
geom_point(aes(Petal.Length,Petal.Width,fill=Species),size=2,shape=
coord_fixed(ratio=1,ylim=c(0,2.75),xlim=c(0,7),expand=FALSE) +
scale_y_continuous(breaks=c(0,0.5,1,1.5,2,2.5)) +
scale_x_continuous(breaks=c(0,1,2,3,4,5,6,7)) +
scale_fill_brewer(palette = "Dark2",name="Iris Species") +
labs(x="Petal Length (cm)", y= "Petal Width (cm)") +
theme_bw()+
theme(axis.text=element_text(family="Times",size=12),
axis.title=element_text(family="Times",face="bold",size=12),
legend.text=element_text(family="Times",size=12),
legend.title = (element_text(family="Times",face="bold",size=
)
{{Edet}}
Question 5
{{Sdet}}
Possible Solution{{Esum}}
{{Edet}}
Activate packages
library(ggplot2)
Load the mtcars dataset using the code below. This is a dataset that
comes with R.
data(mtcars)
Question 1
How many cars in this dataset have 4, 6, or 8 cylinders (cyl)?
{{Sdet}}
Solution{{Esum}}
ggplot(mtcars,aes(x=factor(cyl)))+geom_bar(fill="ivory4")
{{Edet}}
Question 2
Does the number of cylinders (cyl) that a car has influence it's quarter mile time (qsec)?
{{Sdet}}
Solution{{Esum}}
ggplot(mtcars,aes(x=factor(cyl),y=qsec))+stat_summary(fun=mean,position
{{Edet}}
Question 3
What is the distribution of fuel efficiency (mpg)? Use 7 bins for this exercise.
{{Sdet}}
Solution{{Esum}}
ggplot(mtcars,aes(x=mpg))+geom_histogram(fill="orange",bins=7)
{{Edet}}
Question 4
Can you create a box plot of horsepower (hp) as a function of the number of cylinders (cyl) a
car has?
{{Sdet}}
Solution{{Esum}}
ggplot(mtcars,aes(x=factor(cyl),y=hp))+geom_boxplot(colour="orangered"
{{Edet}}
library(pheatmap)
library(tidyverse)
Question 1
Could you import the hbr_uhr_normalized_counts.csv file into your workspace?
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Question 2
Explore this gene expression dataset a bit. How many samples (columns) and genes (row
names) does this dataset have?
{{Sdet}}
Solution{{Esum}}
• HBR_1.bam
• HBR_2.bam
• HBR_3.bam
• UHR_1.bam
• UHR_2.bam
• UHR_3.bam
The samples with names starting with HBR are from the Human Brain Reference (HBR) and
those with names starting with UHR are from the Universal Human Reference (UHR). Remember
this for a later questions.
hbr_uhr_normalized_counts
{{Edet}}
Question 3
Create a heatmap for to visualize gene expression for this dataset.
{{Sdet}}
Solution{{Esum}}
pheatmap(hbr_uhr_normalized_counts,scale="row")
{{Edet}}
Question 4
Create a data frame called annotation_df that contains the sample and treatment group
information that we will add to the legend for this heatmap.
{{Sdet}}
Solution{{Esum}}
## treatment
## HBR_1.bam HBR
## HBR_2.bam HBR
## HBR_3.bam HBR
## UHR_1.bam UHR
## UHR_2.bam UHR
## UHR_3.bam UHR
{{Edet}}
Question 5
Add the annotations for the legend and color the HBR samples orangered and the UHR
samples blue. Also, add a title to the heatmap.
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Other Resources
1. Helpful search engine for R: rseek (https://rseek.org/)
Navigating DNAnexus
DNAnexus is a Cloud-based platform for NextGen Sequence analysis for which
CCR has a "site-license". For this class we are using the platform to provide a
uniform, stable, preinstalled interface for R training. This interface makes use of the
Web version of R-studio. In addition to the R-studio interface this process also
integrates the course-notes for the class in one window.
The following instructions should be followed when using this resource during
formal class time. For using this resource outside class times see the document
entitled "DNAnexus Basics".
Once you select your name from the correct file, a window with the RStudio login page
will open.
Log in using the username "rstudio" and the password "rstudio". At this point you will be
presented with the RStudio main interface (shown below).
4. Splitting the window - If you wish to integrate the class notes into the same window as the
R-Studio interface, click on the file "Hsplit.html" or "Vsplit.html" (found in the lower right
hand segment) and select the "View in Web Browser" option from the pop-up menu. This
will add the class notes to the top portion of the browser window. There is a horizontal or
vertical bar separating the class notes window from the RStudio interface, and this bar
can be dragged up and down or right to left, depending on which file you selected
(Hsplit.html vs Vsplit.html), to change the size of the window dedicated to each function.
Note
Your class project ID WILL NOT be the same as the project ID in the picture.
• Step 1: Under Analysis Settings, enter your name in the Execution Name field.
• Step 3: Select the green button in the upper right labeled "Start Analysis".
• Step 4: Navigate to the "Monitor" tab to check the status of the job.
The reason for waiting ~5 mins is to give the system time to get everything in place. If you click
too soon you will see an error message.
Don't PANIC, just wait a little longer and refresh the screen, until you are finally presented with
the RStudio login screen.