StatComp Script
StatComp Script
Laura Vana-Gür
Contents
1 Part 1: Intro to R programming 4
1.1 R & RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Help for using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.1 Workflow of reproducible research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2 Sweave and R Markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Names and values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.5 Basic data types in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.6 Basic data structures in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.6.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.6.2 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.6.3 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.6.4 S3 Atomic vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.6.5 Arrays and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.6.6 Data frames and tibbles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.6.7 NULL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.6.8 Some more functions for dealing with different data structures . . . . . . . . . . . . . 46
1.7 Subsetting data structures in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.7.1 Subsetting atomic vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.7.2 Indexing lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.7.3 Indexing matrices and arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.7.4 Subsetting data frames and tibbles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1.7.5 Subsetting arbitrary S3 objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.7.6 More on standard subsetting operators . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.8 Flow control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.8.1 Conditional control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.8.2 Repetitive control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
1.9 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
1.9.1 Scope of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.9.2 Lazy evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
1.9.3 Calling functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
1.9.4 Function returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
1.10 Efficient coding, debugging, benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
1.10.1 Readable code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
1.10.2 Efficient code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
1.10.3 Benchmarking and profiling R code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
1.10.4 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
1
2 Part 2: Data exploration and manipulation using the tidyverse 78
2.1 Data import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.1.1 read.table() and derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.1.2 readr of the tidyverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.1.3 data.table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.1.4 Read in excel files xlsx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.1.5 Other formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
2.1.6 Import XML Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
2.1.7 Importing from databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.2 Introduction to tidy data with tidyr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.2.1 Tidy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.2.2 Untidy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.2.3 Tidying and reshaping data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
2.3 Basic data transformation with dplyr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
2.3.1 Data: nycflights13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
2.3.2 Filter rows with filter() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
2.3.3 Arrange rows with arrange() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
2.3.4 Select columns with select() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
2.3.5 Add new variables with mutate() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
2.3.6 Grouped summaries with summarise() . . . . . . . . . . . . . . . . . . . . . . . . . . 102
2.3.7 Combining multiple operations with the pipe . . . . . . . . . . . . . . . . . . . . . . . 103
2.3.8 Useful summary functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
2.3.9 Relational data: merging data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.4 More on data transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
2.4.1 case_when . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
2.4.2 Factors and package forcats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
2.4.3 Dates and times with lubridate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
2.5 Data export . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
2
4.4.4 Multi-page layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.5 Dynamic UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
4.6 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
References 174
Disclaimer: This script is work in progress. It is compiled by using materials developed over the years by
the instructor as well as a generous collection of materials from colleagues Klaus Nordhausen, Alexandra
Posekany, Matthias Templ, Peter Filzmoser.
3
1 Part 1: Intro to R programming
1.1 R & RStudio
What is R
• R was developed by Ross Ihaka and Robert Gentleman (the “R & R’s” of the University of Auckland).
• Ihaka, R., Gentleman, R. (1996): R: A language for data analysis and graphics, Journal of Computational
and Graphical Statistics, 5, 299-314.
• R is a environment and language for data manipulation, calculation and graphical display.
• R is a GNU program. This means it is an open source program (as e.g. Linux) and is distributed for
free.
• R is used by more than 2 million users worldwide (according to R Consortium).
• R was originally used by the academic community but it is currently also used by companies like Google,
Pfizer, Microsoft, Bank of America . . .
R communities
• R has local communities worldwide for users to share ideas and learn.
• R events are organized all over the world bringing its users together:
– Conferences (e.g. useR!, WhyR?, eRum)
– R meetups: check out meetup.com
4
R and related languages
• R can be seen as an implementation or dialect of the S language, which was developed at the AT & T
Bell Laboratories by Rick Becker, John Chambers and Allan Wilks.
• The commercial version of S is S-Plus.
• Most programs written in S run unaltered in R, however there are differences.
• Code written in C, C++ or FORTRAN can be run by R too. This is especially useful for computationally-
intensive tasks.
How to get R
• R is available for most operating systems, as e.g. for Unix, Windows, Mac and Linux.
• R can be downloaded from the R homepage http://www.r-project.org
• The R homepage contains besides the download links also information about the R Project and the R
Foundation, as well as a documentation section and links to projects related to R.
• R is available as 32-bit and 64-bit
• R comes normally with 14 base packages and 15 recommended packages
CRAN
• CRAN stands for Comprehensive R Archive Network
• CRAN is a server network that hosts the basic distribution and R add-on packages
• Central server: http://cran.r-project.org
• New R versions are usually released every few weeks.
R extension packages R can be easily extended with more packages, most of them can be downloaded
from CRAN too. Installation and updating of those packages is however also possible with using R itself
In R, packages can be directly installed by typing in the console install.packages(). Once installed
(locally), each time functions from the packages should be employed, the packages get loaded and attached to
the workspace by library().
Currently more than 20000 are available on CRAN).
• Packages for the analysis and comprehension of genomic data can be downloaded from the Bioconductor
pages (http://www.bioconductor.org).
• but R packages are available form many other sources like R-forge, Github, . . .
Other distributions of R
• As R is open source and published under a GNU license one can make also a own version of R and
distribute it.
• For example Microsoft has Microsoft R Open https://mran.microsoft.com/open
• But there are many others too. We use however here the standard R version from CRAN.
5
• powerful tools for communicating results. R packages make it easy to produce html or pdf reports, or
create interactive websites.
• a well-developed, simple and effective programming language.
Therefore is not only a plain statistics software package, but it can be used as one. Most of the standard
statistics and a lot of the latest methodology is available for R.
R screenshot
R console
• R by default has no graphical interface and the so called Console has to be used instead.
• The Console or Command Line Window is the window of R in which one writes the commands and in
which the (non-graphic) output will be shown.
• Commands can be entered after the prompt (>).
• In one row one normally types one command (enter submits the command). If one wants to put more
commands in one row the commands have to be separated by a “;”.
• When a command line stars with a “+” instead of “>”it means that the last submitted command was
not completed and one should finish it now.
6
• All submitted commands of a session can be recalled with the up and down arrows ↑↓.
7
1.1.0.0.1 RStudio screenshot
The default view is RStudio is the following:
• The main window in RStudio contains five parts: one Menu and four Windows (“Panes”)
• From the drop-down menu RStudio and R can be controlled.
• Pane 1 (top left) - Files and Data: Editing R-Code and view of data sets
• Pane 2 (top right) - Workspace and History:
– Workspace lists all objects in the workspace
– History shows the complete code that was typed or executed in the console.
• Pane 3 (bottom right) - Files, Plots, Packages, Help:
– Files, to manage files
– Plots, to visualise and export graphics
– Packages, to manage extension packages
– Help, to access information and help pages for R functions and datasets
• Pane 4 (bottom left) - Console: Execution of R-Code
• This pane layout (and the pane contents) can be adapted using the options menu.
R: a short statistical example A more sophisticated example will demonstrate some features of R:
> options(digits = 4)
> # setting random seed to get a reproducible example
> set.seed(1)
> # creating data
> eps <- rnorm(100, 0, 0.5)
> eps[1:5]
[1] -0.31323 0.09182 -0.41781 0.79764 0.16475
> group <- factor(rep(1:3, c(30, 40, 30)),
+ labels = c("group 1", "group 2", "group 3"))
8
> x <- runif(100, 20, 30)
> y <- 3 * x + 4 * as.numeric(group) + eps
90 100
y
80
70
28
x
24
20
3.0
group
2.0
1.0
9
100
90
y
80
70
group
Build a linear model:
> # fitting a linear model and looking at it
> lm.fit <- lm(y ~ x + group)
> lm.fit
Call:
lm(formula = y ~ x + group)
Coefficients:
(Intercept) x groupgroup 2 groupgroup 3
3.77 3.01 4.07 7.98
> # more detailed output
> summary(lm.fit)
Call:
lm(formula = y ~ x + group)
Residuals:
Min 1Q Median 3Q Max
-1.1988 -0.2797 0.0198 0.2792 1.0893
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.7682 0.4288 8.79 6e-14 ***
x 3.0110 0.0169 178.19 <2e-16 ***
groupgroup 2 4.0666 0.1094 37.18 <2e-16 ***
groupgroup 3 7.9754 0.1201 66.38 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
10
Multiple R-squared: 0.997, Adjusted R-squared: 0.997
F-statistic: 1.1e+04 on 3 and 96 DF, p-value: <2e-16
Standardized residuals
Residuals vs Fitted Q−Q Residuals
61 61
Residuals
2
0.5
−2 0
−1.0
24 14 14 24
70 80 90 100 −2 −1 0 1 2
Standardized residuals
67
Cook's distance
0.0
−3
14
11
• “Unofficial” tutorials and manuals, also in other languages than English can be found also on the R
homepage under Documentation -> Other or on CRAN under Documentation -> Contributed. Very
useful from here is the R reference card by Tom Short.
R tutorials for SAS, Stata or SPSS users A lot of new R users are familiar with SAS, Stata and/or
SPSS and therefore special charts for an overview how to do things they used to do in SAS, Stata or SPSS
can be done in R and a extended manual for an easier move to R are available.
The following references might then be helpful:
• http://r4stats.com
• Muenchen, R.A. (2008): R for SAS and SPSS Users
• Muenchen, R.A. and Hilbe, J. (2010): R for Stata Users
Help within R
• There are three type of help types available in R. They can be accessed via the menu or the command
line. Here only the command line versions will be explained
• Using an internet browser:
> help.start() will evoke an internet browser with links to manuals, FAQs the help pages off all
functions sorted by packages together with an search engine.
• The help command:
> help(command) will show command. A shorter version that does the same is > ?command. For a few
special commands the help works only when the command is quoted, e.g. > help("if")
• The help.search command
> help.search("keyword") one can search all titles and aliases of the help files for keywords. A
shorter version that does the same is > ??keyword. This is however not a full text search.
There are also three other functions useful to learn about functions.
• apropos: apropos("string") searches all functions that have the string in their function name
• demo: The demo function runs some available scripts to demonstrate their usage. To see which topics
have a demo script submit > demo()
• example: > example(topic) runs all example codes from the help files that belong to the topic topic
or use the function topic.
Also in case you remember the beginning of a function or are just lazy - R has also an auto completion
feature. If you start typing a command and hit tab R will complete the command if there are no alternatives
or will you give all the alternatives.
12
1.3 Reproducibility
Reproducibility is one of the key concepts of scientific research. It has several meanings:
• the same researcher or another researcher should be able to replicate an entire experiment or study and
yield results with a high agreement with the previous results.
• given the data from an experiment also the analysis should be reproducible and anyone analyzing it
should come to the same conclusion.
The reproducibility crisis refers to a methodological crisis in science, in which scientists have found in
recent years that the results of many scientific experiments are difficult or impossible to replicate. Even by
the same researcher.
Sciences especially hit by this crisis are psychology and medicine.
For example in the the so-called reproducibility project collaborated 270 researchers from around the
world to replicate 100 empirical studies from three top Psychology journals. Fewer than half of the attempted
replications were successful.
In a Nature paper Begley, C. G. & Ellis, L. M. (2012) showed that from 53 medical papers on cancer research
47 were not reproducible.
As a reaction to the reproducibility crisis the concept of reproducible research was introduced. The aim
there is that when research is published it also releases:
• the original data
• detailed description of the computational environment
• the complete code of the analysis
Hence everyone can reanalyze the data step for step.
The goal is even that the data and the code are part of the published report. And it is nowadays easy to
include R into documents and whenever the document is compiled the R code is executed and the computed
results included in the report.
13
3. Statistical analysis
4. Present the results
• In the form of books, articles, slides (preferable format is hence pdf)
• In the web (preferable format is hence html)
During a project and all its steps it is assumed that everything is well documented and many files belong to
the project.
In general it is preferred that the files are all text files and human readable.
• data files should end for example as .txt or .csv and not for example .xlsx or .sav.
• Analysis files should for example script files written in an editor and for R users end with .R.
• Files making the presentation files should be .tex or .md (**.Rmd*’).
Using human readable text formats makes it the least likely that it depends on the versions of the software
with which it was created and any user, even those not knowing for example R or LATEX
will be able to open the files and get an idea about the content.
Costs and benefits of reproducible research For a person producing reproducible research this comes
at a cost:
• The knowledge how to do reproducible research needs to be acquired.
• Additional software needs to be learned.
It is however assumed that in the long run there are many benefits:
• Better working habits. Individual steps are often better planned and code is more through and
commented compared to the case when it is assume that no one is looking.
• It helps for teamwork as it is easier for each collaborator to see what is going on.
• Changes are easier as your workflow is dynamic and changes will directly be considered in later parts.
For example when you get more data or so on.
• Higher research impact - it is usually considered that reproducible research gets more attention as it is
considered “trustworthier”.
Commenting All files maybe except the data files should be well documented. Comments should be
understandable for another user and not only for the author.
• In R everything is a comment that appears after a hash #.
• In LATEX everything after the percent sign % is treated as a comments.
• In html /markdown comments are placed inside <!- ->.
Data files could have their own files “commenting” them.
Literate programming A key concept of reproducible research is literate programming which means
that a source file can be “read” by a computer woven (weave) or knitted (knit) together with a formatted
presentation document.
In this course we will focus on how R code can be combined with other formatting languages to produce
dynamically pdf or html output.
14
Sweave Sweave is part of every R installation and details can be obtained using
help("Sweave", package="utils")
To use Sweave, the user needs to know R and LATEX. The basic idea is that in a Sweave file (ending .Rnw)
R code chunks are embedded in the tex paper using the so-called noweb style. Then first the file is run in R
using
Sweave("file.Rnw")
which executes all the R code chunks and replaces the code with the output and creates the figures, yielding
the corresponding file.tex which can then be compiled into a pdf.
An R code chunk in a Sweave file has the form
<<label=Name, options>>=
Some R code
Nothing else should be in the lines that start and end the chunk.
To include within LATEXshort R output one can use \Sexpr{R_code}.
The most important code chunk options are:
• eval = TRUE: Should the code chunk be evaluated?
• echo = TRUE: Should the code be included in the output or only the results?
• results = hide: the chunk is evaluated but any output suppressed.
• fig = TRUE: If the chunk should make a figure and include it. By default it will make a eps and pdf
figure.
• eps = FALSE: If eps figure should be made.
• width = 6.5: width of the figure in inch.
• height = 8.5: height of the figure in inch.
To extract all the code from code chunks (not from \Sexpr) one can use in R the function
Stangle("file.Rnw")
This will produce the file file.R which can be for example sourced.
R Markdown While Sweave is mainly for combining R and LATEX(but HTML is for example also possible)
there are meanwhile more modern and flexible alternatives such as R Markdown.
R Markdown is a relatively easy to learn markup language which calls usually first a program called Pandoc
which converts the file in a weaveable or knittable file appropriate for the corresponding target output (for
example pdf of html).
All R Markdown files (file extension .Rmd) start with a so-called YAML header which tells Pandoc some
key information.
A header has usually the form:
15
---
title: "A Pdf report"
author: "Your name"
date: "06.09.2016"
output: pdf_document
toc: true
---
This header for example defines that the output will be a pdf and toc: true says the document should have
a table of contents.
The easiest way to get a basic header is using in RStudio File -> New File -> R Markdown . . . and
choosing there the desired type.
There are many different output types and the different types have different options.
The options are for example listed in the R markdown reference guide in the sections about Pandoc.
For this course we’ll only look at the basic options to create:
• html documents
• pdf documents
• beamer slides
All YAML headers should contain title, author, data and output specifications.
Subarguments are then usually a bit indented.
An example of a html YAML header:
---
title: "A webpage"
author: "It was me"
date: "01.01.2001"
output:
html_document:
theme: united
highlight: zenburn
---
16
Example YAML header for beamer slides (with bibtex style bibliography saved in a .bib file in the same
folder as the .Rmd)
---
title: "Statistical Programming"
subtitle: "Reproducible Research"
author: "Max Mustermann"
date: "Autumn 2016"
output:
beamer_presentation:
incremental: true
theme: "Berkeley"
colortheme: "crane"
fonttheme: "structurebold"
bibliography: bibliography.bib
---
Some hints using RStudio and R Markdown When using RStudio to write R Markdown files, one
case you Ctrl + Alt + I to insert an R code chunk.
When then pointing the cursor after the r and pressing tab RStudio opens a dropdown menu with code
chunk options.
Useful R code chunk options Some useful code chunk options are:
• eval: TRUE/FALSE. Should the code be executed or not?
• include: TRUE/FALSE. The code will be executed but should it be included in the final document?
• collapse: TRUE/FALSE. Should the source and output blocks be collapsed into a single block or not?
• echo: TRUE/FALSE. Should the code be displayed in the final document or not?
• results:
1. “markup”: the default
2. “hide”: the results will not be shown in the output
3. “asis”: the results of the R code will not be reformatted. Useful for example when the R function
returns html or LATEX.
• message/warning/error: TRUE/FALSE. Should messages/warnings/errors produced by the code be
included in the final document?
17
• comment: default is "##" but can be any character string. The string will be added in front of each
line of output in the final document.
• prompt: FALSE/TRUE. Should a > be added in front of each line of code in the final document?
• highlight: TRUE/FALSE. Should source code be highlighted in the final document?
• tidy: FALSE/TRUE. Should code chunks be tidied for display using the tidy_source function from
the formatR package?
• dev: the R function name of device to be used to record plots made in the chunk. Default is “png”.
• dev.args: list of arguments passed on to the device in R.
• fig.align: How to align figures in the final graph. Options are "left", "center" and "right"
• fig.cap: character string to be used as the Figure caption in R.
• fig.height/fig.height: Width and Height in inches for the plots created in R.
It is possible to change R code chunk options globally but that can be rather advanced. For example I do not
like the default for comment. So I add usually at the beginning of the document the chunk:
```{r include=FALSE}
knitr::opts_chunk$set(comment = NA)
```
While here we discuss only R as computing language - markdown can be used with other programming
languages.
The code chunks are then quite similar and after { the r is replaced with the name of the other language and
the some options are different. (One can use otherwise the option engine to specify which programming
language should execute the code.)
I personally use as a trick often to get a verbatim kind of environment:
```{bash eval=FALSE}
Block in verbatim
```
Tables in R Markdown Making basic tables in R Markdown is quite simple - a vertical line specifies a
new column and the column names are “underscored” using dashes. Then each table row is simply a new
line. Note that in each row the columns do not have to be aligned in the code - as long as there is the same
number of vertical lines.
So the code
Header 1 | Header 2 | Header 3
-------- | -------- | --------
row 1 | row 1 | row 1
row 2 | row 2 | row 2
yields
18
Header 1 | Header 2 | Header 3
:------- | :------: | -------:
left | centered | right
left | centered | right
yields
Naturally, in most cases the content for tables is created in R and the R results should be put straight into
the table.
Several functions and packages are useful here, to name a few:
• kable: is original function from knitr. Can make LATEXor html tables based on R data.frames and
matrices.
• xtable (Dahl et al. 2019): a package popular for making LATEXor html tables. The package provides
extreme flexibility over the tables.
• texreg: a package that provides automatically high quality tables for LATEXor html outputs. More or
less however only for R objects of specific classes.
In this context the code chunk option results = "asis" is important.
For tables using the kable package, we basically just have to call the kable function for the output we want
the table for in an R code chunk. To fine tune the table then further arguments should be passed on to the
function. See ?kable for all options.
For example:
```{r echo=FALSE}
library("MASS") # here is the data library("knitr") # for the kable function kable(head(crabs))
```
yields the following table:
sp sex index FL RW CL CW BD
B M 1 8.1 6.7 16.1 19.0 7.0
B M 2 8.8 7.7 18.1 20.8 7.4
B M 3 9.2 7.8 19.0 22.4 7.7
B M 4 9.6 7.9 20.1 23.1 8.2
B M 5 9.8 8.0 20.3 23.0 8.2
B M 6 10.8 9.0 23.0 26.5 9.8
19
Table 4: Title of Table
More sophisticated tables can be achieved with the xtable package. It works especially well with objects
from classes it “knows” about.
For a detailed list run methods(xtable) in R.
The tables are then made with the commands
library("xtable")
MyTable <- xtable(object, options)
print(MyTable, more options)
By default xtable assumes that the target output is created using LATEXso in print a frequent argument is
type="html".
Consider the code:
```{r results="asis", echo=FALSE}
fit1 <- lm(FL ~ sp + sex + RW, data=crabs)
library(xtable)
fit1table <- xtable(summary(fit1))
print(fit1table, comment=FALSE, floating = FALSE)
```
gives the output:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.5964 0.3426 -7.58 0.0000
spO 0.9439 0.1265 7.46 0.0000
sexM 2.2507 0.1255 17.93 0.0000
RW 1.3017 0.0258 50.52 0.0000
Tables using texreg Making tables with xtable is easiest when there is only one model object and if
xtable knows the class.
The texreg package (Leifeld 2013) knows more classes and also allows tables combining multiple objects.
The package has mainly two functions for making tables:
• texreg for making tables for documents using LATEX.
• htmlreg for documents needing tables in html format.
The arguments of both functions are similar but of course differ a bit.
20
Model 1
(Intercept) −2.60∗∗∗
(0.34)
spO 0.94∗∗∗
(0.13)
sexM 2.25∗∗∗
(0.13)
RW 1.30∗∗∗
(0.03)
R2 0.94
Adj. R2 0.94
Num. obs. 200
∗∗∗ ∗∗
p < 0.001; p < 0.01; ∗ p < 0.05
Tables of non-standard objects All three functions presented here to make tables directly from R work
on specific objects.
However not all objects classes are known and sometimes one wants to get a table based on some of these
objects.
Often xtable is used for that. The idea is usually that xtable when encountering a matrix or data.frame
makes the rows to the rows of the table and the columns to the columns of the table.
Hence a strategy to make tables for unsupported class objects is to
1. find and extract the relevant information from that class object
2. convert these information into a matrix or data.frame
3. use xtable on this matrix or data.frame.
Including figures in R Markdown Similar to tables also figures need often to be included in documents
and again we will first consider the case of figures which are entered “manually” or which are dynamically
21
created in R code chunks. Assume we have a file Rlogo.png in the same folder as our .Rmd file and would
like to add that.
The code is then simply

Figure 1: R
This is really simple - however so simple that there is no control over the size or positioning the figure. TEXT
is optional and the PATH TO FILE specifies where to find the file and its name. Most common file types like
pdf, jpeg and png are acceptable.
Hence for more sophisticated figures to be added one has to use direct html or latex markup depending on
the target document format.
22
Including figures created in R Markdown This is actually already covered - everything is specified
in the code chunk options. And it does not matter here if we use basic plotting function, lattice figures or
ggplot figures.
For example
```{r scatterplot, results='hide', echo=FALSE}
with(crabs, plot(FL, RW, col=as.numeric(sp)+1))
legend('topright', levels(crabs$sp), pch=1, col=1:2)
```
20
B
O
18
16
14
RW
12
10
8
6
10 15 20
FL
To see the regression diagnostic plots of the previous regression model fit1 we have:
```{r fit1plot, results='hide', echo=FALSE, fig.width=7, fig.height=5}
par(mfrow=c(2,2))
plot(fit1)
```
23
Standardized residuals
Residuals vs Fitted Q−Q Residuals
45 150 150 45
2
2
Residuals
0
−2
−2
98 98
10 15 20 25 −3 −2 −1 0 1 2 3
Standardized residuals
Scale−Location Residuals vs Leverage
3
45 98 150
150
1.0
1
−1
200
Cook's distance98
0.0
Equations in R Markdown As one of the last building blocks of statistical documents, we also have a
look at equations. They are in R Markdown basically the same as in LATEX. Inline equations are specified
with in $equation$ and otherwise as
$$
equation
$$
Note that for inline equations now spaces can be after the opening $ and before the closing $.
The aim of the course is not to teach LATEXso we’ll give in the following just a few examples to demonstrate
how it works.
Pn
• $\sum_{i=1}ˆn x_iˆ2$ : i=1 x2i
Qn
• $\prod_{i=1}ˆn \log(x_i)$ : i=1 log(xi )
Rb
• $\int_aˆb \alpha y dy$ : a αydy
sin(x)
• $\frac{\sin(x)}{3\beta}$ : 3β
• $\Sigmaˆ\top$ : Σ⊤
• $\Omegaˆ{-1/2}$ : Ω−1/2
• $p \times q $ : p × q
• $j=1,\ldots,n$ : j = 1, . . . , n
• $f(\cdot)$ : f (·)
24
• $n \ll p$ : n ≪ p
• $\tilde{x}$ : x̃
• $\mathcal{A,B,C}$: A, B, C
• $\mathbb{R,N,Z}$: R, N, Z
$$
\left(\begin{array}{cc}
0.8944272 & 0.4472136\\
-0.4472136 & -0.8944272
\end{array}\right)ˆ\top
\left(\begin{array}{cc}
10 & 0\\
0 & 5
\end{array}\right)
$$
⊤
0.8944272 0.4472136 10 0
−0.4472136 −0.8944272 0 5
$$
\alpha = \sqrt{\frac{1}{\pi-3}} \quad \mbox{and} \quad
\sigmaˆ2 = \frac{1}{n-1}\left(\sum_{i=1}ˆn (x_i - \bar x)ˆ2\right)
$$
r n
!
1 2
X
2
α= and σ = (xi − x̄)
π−3 i=1
Writing the document Now that we have all the bricks to weave different parts into our document we
just need to learn still how to actually “write” in R Markdown. This is pretty simple!
Normal text is just written as normal text. And to get a new paragraph the end of the line needs two
spaces.
And otherwise we have the following syntax.
Syntax Outcome
plain text plain text
*italics* italics
**bold** bold
~~strikethrough~~ strikethrough
[linkToTU](www.tuwien.ac.at) linkToTU
# Header 1 # Header 1
## Header 2 ## Header 2
### Header 3 ### Header 3
#### Header 4 #### Header 4
##### Header 5 ##### Header 5
*** horizontal rule or slide break
> block quote > block quote
25
Verbatim text is embedded in single ticks when inline and in three ticks for bigger blocks.
To make an unordered list use:
* item 1
* item 2
+ subitem 1
+ subitem 2
• item 1
• item 2
– subitem 1
– subitem 2
For subitems it seems that before the sublist initiator there should be 4 spaces.
To make an ordered list use:
1. item 1
2. item 2
a. subitem 1
b. subitem 2
1. item 1
2. item 2
a. subitem 1
b. subitem 2
Note on session info In statistical analyses results might depend on the software versions used. Therefore
in R it is of importance from a reproducibility point of view to give all the relevant details.
For example R and all packages should be properly cited and the version number reported.
To find out what you all used for an analyses in R you can use the sessionInfo() function. Here is the
session info for the session used to compile this script:
sessionInfo()
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
26
[1] digest_0.6.37 R6_2.5.1 fastmap_1.2.0 xfun_0.47
[5] htmltools_0.5.8.1 rmarkdown_2.27 tinytex_0.51 cli_3.6.3
[9] compiler_4.4.0 httr_1.4.7 highr_0.11 rstudioapi_0.16.0
[13] tools_4.4.0 evaluate_1.0.0 yaml_2.3.10 rlang_1.1.5
Further reading To learn more about these there is a lot of free material on the web available.
• For LATEXfor example the “The Not So Short Introduction to LaTeX” is quite popular.
• For Sweave a good starting place is The Sweave Homepage
• Christopher Gandrud: Reproducible Research with R and R Studio, Chapman and Hall/CRC.
• R Markdown Cheatsheet in TUWEL.
and x now appears in the Global Environment pane in RStudio. It can then we used in subsequent operations
e.g.,
x + 1
[1] 2
It’s easy to read the assignment command as: “create an object named x, containing the value 1”. The more
proficient R users must note that this is a simplification and that it is not representative of what R is
actually doing behind the scenes. It’s more accurate to say that this code is doing two things:
• It’s creating an object with value 1.
• And it’s binding that object to a name, x.
In other words, the object, or value, doesn’t have a name; it’s actually the name that has a value. This subtle
distinction is important to understand how objects are stored in R. We will return to this later in the course.
R has strict rules about what constitutes a valid name. A syntactic name must consist of letters, digits, .
and _ but can’t begin with _ or a digit. Additionally, you can’t use any of the reserved words like TRUE,
NULL, if, and function (see the complete list in ?Reserved). A name that doesn’t follow these rules is a
non-syntactic name; if you try to use them, you’ll get an error:
_abc <- 1
if <- 10
[1] 1
`if` <- 10
`if`
[1] 10
27
Such weird names typically appear when you load data that has been created outside of R. For some data
structures (see e.g., data frames later) the non-syntactic names are converted to syntactic names by the
function make.names():
make.names(c("_abc", "", "a-b", ".123e1"))
> k <- 2 + 4i
> typeof(k)
[1] "complex"
Missing values and other special values in R Missing values in R are specified as NA which is
internally a logical vector of length 1.
typeof(NA)
[1] "logical"
To create NAs of a specific type one can use NA_real_, NA_integer_ or NA_character_.
28
typeof(NA_character_)
[1] "character"
Inf is infinity. You can have either positive or negative infinity.
> 1 / 0
[1] Inf
> -1 / 0
[1] -Inf
> is.infinite(-Inf)
[1] TRUE
> is.finite(-Inf)
[1] FALSE
Homogeneous Heterogeneous
1D vector list
2D matrix data frame
nD array
Scalars as on the previous slide are treated as vectors of length 1. And almost all other types of objects in R
are build upon these five structures.
To understand the structure of an object in R the best is to use
str(object)
1.6.1 Vectors
The most basic structure is a vector. They come as two different flavors:
• atomic vector
• list
And a vector must have three properties:
• of what type it is (typeof)
• how long it is (length)
• which attributes it has (attributes)
29
The function c() The most direct way to create a vector is the c function where all values can be the
entered. The values are then concatenated.
object.x <- c(value1, value2, ...)
A single number is also treated like a vector but can be easier assigned to an object:
object.x <- value
LogVector
[1] TRUE FALSE FALSE TRUE
IntVector
[1] 1 2 3 4
DouVector
[1] 1 2 3 4
ChaVector
[1] "a" "b" "c" "d"
Checking for the types of a vector Aside from typeof, to check for a specific type the “is”-functions
can be used:
• is.character
• is.double
• is.integer
• is.logical
• is.atomic
typeof(IntVector)
[1] "integer"
typeof(DouVector)
[1] "double"
is.atomic(IntVector)
[1] TRUE
is.character(IntVector)
[1] FALSE
is.double(IntVector)
[1] FALSE
is.integer(IntVector)
[1] TRUE
is.logical(IntVector)
[1] FALSE
30
is.numeric(ChaVector)
[1] FALSE
Sequences and replications To create a vector that has a certain start and ending point and is filled
with points that have equal steps between them, the function seq can be used.
x <- seq(from = 0, to = 1, by = 0.2)
x
[1] 0.0 0.2 0.4 0.6 0.8 1.0
y <- seq(length = 6, from = 0, to = 1)
y
[1] 0.0 0.2 0.4 0.6 0.8 1.0
z <- 1:5
z
[1] 1 2 3 4 5
The function rep can be used to replicate objects in several ways. For details see the help of the function.
Here are some examples
x <- rep(c(2, 1), 3)
x
[1] 2 1 2 1 2 1
y <- rep(c(2, 1), each = 3)
y
[1] 2 2 2 1 1 1
z <- rep(c(2, 1), c(3, 5))
z
[1] 2 2 2 1 1 1 1 1
The sample function allows us to obtain of random sample of a specified size from certain elements given in
a vector. The following code corresponds to the results of a 6-sided die:
sample(1:6, size = 8, replace = TRUE)
[1] 1 1 3 1 1 6 6 6
Logical operators in R Logical vectors are usually created by using logical expressions. The logical
vector is of the same length as the original vector and gives elementwise the result for the evaluation of the
expression.
The logical operators in R are:
Operator Meaning
== =
!= ̸ =
< <
> >
>= ≥
<= ≤
Two logical expressions L1 and L2 can be combined using:
31
L1 & L2 for L1 and L2
L1 | L2 for L1 or L2
!L1 for the negation of L1
xor(x, y) elementwise exclusive OR (returns FALSE if both x and y are TRUE)
Logical vectors typically created in the following way:
age <- c(42, 45, 67, 55, 37, 73, 77)
older50 <- age > 50
older50
[1] FALSE FALSE TRUE TRUE FALSE TRUE TRUE
When one wants to enter a logical vector TRUE can be abbreviated with T and FALSE with F, this is however
not recommended.
Other functions useful for working with logical vectors are all() (are all values in the vector true?) and
any() (is at least one of the values true?).
Vector arithmetic With numeric vectors one normally wants to perform calculations.
When using arithmetic expressions they are usually applied to each element of the vector.
If an expression involves two or more vectors these vectors do not have to be of the same length, the shorter
vectors will be recycled until they have as many elements as the longest vector.
Important expressions here are:
+, -, *, /, ^, log, sin, cos, tan, sqrt,
min, max, range, mean, var
Here is an short example for vector arithmetic and the recycling of vectors
x <- 1:4
x
[1] 1 2 3 4
y <- rep(c(1,2), c(2,4))
y
[1] 1 1 2 2 2 2
x ˆ 2
[1] 1 4 9 16
x + y
Warning in x + y: longer object length is not a multiple of shorter object
length
[1] 2 3 5 6 3 4
Basic operations on character vectors Taking substrings using substr (alternatively substring can
be used but it has slightly different argument):
cols <- c("red", "blue", "magenta", "yellow")
substr(cols, start = 1, stop = 3)
32
[1] "red_flowers" "blue_flowers" "magenta_flowers" "yellow_flowers"
paste(cols, "flowers", collapse = ", ")
Coercion
• As all elements in an atomic vector must be of the same type it is of course of interest what happens if
they aren’t.
• In that case the different elements will be coerced to the most flexible type.
• The most flexible type is usually character. But for example a logical vector can be coerced to an
integer or double vector where TRUE becomes 1 and FALSE a 0.
• Coercion order: logical -> integer -> double -> (complex) -> character
v1 <- c(1, 2L)
typeof(v1)
[1] "double"
v2 <- c(v1, "a")
typeof(v2)
[1] "character"
v3 <- c(2L, TRUE, TRUE, FALSE)
typeof(v3)
[1] "integer"
• Coercion often happens automatically. Most mathematical functions try to coerce vectors to numeric
vectors. And on the other hand, logical operators try to coerce to a logical vector.
• In most cases if coercion does not work, a warning or error message is returned.
• In programming to avoid coercion to a possibly wrong type the coercion is forced using the “as”-
functions like as.character, as.double, as.numeric,. . .
LogVector
[1] 1 0 0 1
ChaVector2 <- c("0", "1", "7")
as.integer(ChaVector2)
[1] 0 1 7
ChaVector3 <- c("0", "1", "7", "b")
as.integer(ChaVector3)
1.6.2 Lists
Lists are different from atomic vectors as their elements do not have to be of the same type.
33
To construct a list one usually uses list.
List1 <- list(INT = 1L:3L,
LOG = c(FALSE, TRUE),
DOU = DouVector,
CHA = "z")
str(List1)
List of 4
$ INT: int [1:3] 1 2 3
$ LOG: logi [1:2] FALSE TRUE
$ DOU: num [1:4] 1 2 3 4
$ CHA: chr "z"
The number of components of a list can be obtained using length.
length(List1)
[1] 4
[[2]]
NULL
Several lists can be combined into one list using c. If a combination of lists and atomic vectors is given to c
then the function will first coerce each atomic vector to lists before combining them.
List3 <- c(List1, list(new = 7:10, new2 = c("G", "H")))
str(List3)
List of 6
$ INT : int [1:3] 1 2 3
$ LOG : logi [1:2] FALSE TRUE
$ DOU : num [1:4] 1 2 3 4
$ CHA : chr "z"
$ new : int [1:4] 7 8 9 10
$ new2: chr [1:2] "G" "H"
List4 <- list(a = 1, b = 2)
Vec1 <- 3:4
Vec2 <- c(5.0, 6.0)
List5 <- c(List4, Vec1, Vec2)
List6 <- list(List4, Vec1, Vec2)
str(List5)
List of 6
$ a: num 1
$ b: num 2
$ : int 3
$ : int 4
$ : num 5
$ : num 6
str(List6)
List of 3
34
$ :List of 2
..$ a: num 1
..$ b: num 2
$ : int [1:2] 3 4
$ : num [1:2] 5 6
More on lists
• The typeof a list is a list.
• is.list can be used to check if an object is a list.
• as.list can be used to coerce to a list.
• To convert a list to an atomic vector unlist can be used. It uses the same coercion rules as c.
• From many statistical functions which return more complicated data structures the results are actually
lists.
1.6.3 Attributes
• All objects in R can have additional attributes to store metadata about the object. The number of
attributes is basically not limited. And it can be thought of as a named list with unique component
names.
• Individual attributes can be accessed using the function attr or all at once using the function
attributes.
Examples:
VecX <- 1:5
attr(VecX, "attribute1") <- "I'm a vector"
attr(VecX, "attribute2") <- mean(VecX)
attr(VecX, "attribute1")
[1] "I'm a vector"
attr(VecX, "attribute2")
[1] 3
attributes(VecX)
$attribute1
[1] "I'm a vector"
$attribute2
[1] 3
typeof(attributes(VecX))
[1] "list"
In R 3 attributes play a special role and we will came back later to them in more detail and just mention
them now shortly:
• names: the names attributes is a character vector giving each element a name. This will be discussed
soon.
• dimension: the dim for dimension attribute will turn vectors in matrices and arrays.
• class: the class attribute is very important in the context of S3 classes discussed later.
Depending on the function used attributes might or might not get lost. The three special attributes mentioned
earlier have special roles and are usually not lost, many other attributes get however often lost.
attributes(5 * VecX - 7)
$attribute1
[1] "I'm a vector"
35
$attribute2
[1] 3
attributes(sum(VecX))
NULL
attributes(mean(VecX))
NULL
The names attribute There are three different ways to name a vector:
1. Directly at creation:
Nvec1 <- c(a = 1, b = 2, c = 3)
Nvec1
a b c
1 2 3
Properties of names:
• Names do not have to be unique
• Not all elements need names. If no element has a name the names attribute value is NULL. If some
elements have names but others not, then missing elements get an empty string as name.
• Names are usually the most useful if all elements have a name and if the names are all unique.
• Name attributes can be removed by assigning names(object) <- NULL.
names(c(a = 1, 2, 3))
[1] "a" "" ""
names(1:3)
NULL
1.6.4.1 Factors Categorical data is an important data type in statistics - in R they are usually represented
by factors.
A factor in R is basically an integer vector with two attributes:
36
1. The class attribute which has the value factor and which makes it behave differently compared to
standard integer values.
2. The levels attribute which specifies a set of admissible integers the vector can have.
A factor is usually created with the function factor.
Fac1 <- factor(c("green", "green", "blue"))
Fac1
[1] "factor"
levels(Fac1)
table(buyCha)
buyCha
bought
3
table(buyFac)
buyFac
bought didn't buy
3 0
1.6.4.2 Dates and times in R Dates and times are among the most complicated types to work with on
computers.
• Standard calendar is complicated (leap years, months of different lengths, historically different calendars
- Julian vs. Gregorian).
37
• Times depend of an unstated time zone (add daylight savings :-() and some years have leap seconds to
keep the clocks consistent with the rotation of the earth!
R can flexibly handle dates and times and has different classes for them with different complexity levels.
Most classes offer then also arithmetic functions and other tools to work with date and time objects. A good
overview over the different classes is given in the Helpdesk section of the R News 4(1).
• The built-in as.Date() function handles dates (without times).
• The contributed library chron handles dates and times, but does not control for time zones.
• The POSIXct and POSIXlt classes allow for dates and times with control for time zones.
• The various as. functions can be used for converting strings or among the different date types when
necessary.
The Date class Objects of class Dates are obtained by the as.Date() function and can represent years
since 1 AD and are normally created by transforming a character vector which contains information about
the day, month and year.
Note that the behavior of the function when some of the information is missing depends on your system!
Also if some of the formats can be read depends on your locale.
To get the current date:
Sys.Date()
[1] "2025-03-27"
as.Date() function The as.Date() function allows for a variety of formats through the format= argument.
Code Value
%d Day of the month (decimal number)
%m Month (decimal number)
%b Month (abbreviated)
%B Month (full name)
%y Year (2 digit)
%Y Year (4 digit)
%C Century
[1] "1987-06-02"
as.Date(x[2], format = "%m/%d/%y")
[1] "1998-06-21"
as.Date(x[3], format = "%d%b%Y")
[1] "1960-05-31"
lct <- Sys.getlocale("LC_TIME")
lct # my locale is in English
38
[1] "en_US.UTF-8"
Sys.setlocale("LC_TIME", "de_DE.UTF-8") # different for different OSs
[1] "de_DE.UTF-8"
as.Date(x[3], format = "%d%b%Y") # may doesn't work for German...
[1] NA
Sys.setlocale("LC_TIME", lct) # reset to English
[1] "en_US.UTF-8"
Internal storage of dates and times Except for the POSIXlt class, dates are stored internally as the
number of days or seconds from some reference date. Thus dates in R will generally have a numeric mode,
and the class function can be used to find the way they are actually being stored. The POSIXlt class stores
date/time values as a list of components (hour, min, sec, mon, etc.) making it easy to extract these parts.
E.g., internally, Date objects are stored as the number of days since January 1, 1970 (the Unix epoch), using
negative numbers for earlier dates.
y <- as.Date(x[3], format = "%d%b%Y")
y
[1] "1960-05-31"
as.numeric(y)
[1] -3502
Example: Finnish social security number Every Finn and foreigner working in Finland gets a Finnish
social security number. The number is used as a personal identifier and therefore unique for each individual.
The structure of this number is: DDMMYY C ZZZQ where
• DDMMYY gives the date of birth.
• C specifies the century of birth. + = 19th Cent., - = 20th Cent. and A = 21st Cent.
• ZZZ is the personal identification number. It is even for females and odd for males
• Q is a control number or letter to see if the total number is correct.
We know now that the Finnish social security number contains a lot of useful information. We will discuss
now, how to extract the birthday and sex of an individual from their id using R.
The following functions will be needed for this task:
• substr extracts a substring from a character vector.
• paste can be used to collapse elements of different vectors to one.
• ifelse does a vectorized evaluation of an expression. ifelse(expression, A, B), so if expression is
true, the result will be A and if false B.
• %in% is a logical operator which is TRUE if the for the right side is a match in the left side of the and
otherwise FALSE.
We will use three Finnish fake ids
x <- c("010199-123N", "301001A1234","130620-4567")
xDates <- substr(x, 1, 6)
xSex <- substr(x, 10, 10)
centuries <- ifelse(substr(x, 7, 7) == "+", 19,
ifelse(substr(x, 7, 7) == "-", 20, 21))
x2Dates <- paste(xDates, centuries - 1, sep = "")
birthDates1 <- as.Date(xDates, format = "%d%m%y") # wrong
39
birthDates2 <- as.Date(x2Dates, format = "%d%m%y%C")
sex <- ifelse(xSex %in% c(0, 2, 4, 6, 8),"Female","Male")
cbind.data.frame(sex, birthDates1, birthDates2)
, , 2
40
b 2 4 6
rownames(M1)
[1] "a" "b"
length(M1) ## number of elements in matrix!
[1] 6
c(M1) ## columns are apended in an atomic vector
[1] 1 2 3 4 5 6
Length and names for arrays The counterpart of length for an array is dim and the counterpart of
names is dimnames which is list of character vectors of appropriate length.
dim(A1)
[1] 3 4 2
dimnames(A1) <- list(c("r1", "r2", "r3"), c("c1", "c2", "c3", "c4"),
c("a1", "a2"))
A1
, , a1
c1 c2 c3 c4
r1 1 4 7 10
r2 2 5 8 11
r3 3 6 9 12
, , a2
c1 c2 c3 c4
r1 13 16 19 22
r2 14 17 20 23
r3 15 18 21 24
41
DF1 <- data.frame(V1 = 1:5,
V2 = c("a", "a", "b", "a", "d"))
str(DF1)
'data.frame': 5 obs. of 2 variables:
$ V1: int 1 2 3 4 5
$ V2: chr "a" "a" "b" "a" ...
Most functions which read external data into R also return a data frame.
Note that argument stringAsFactors = TRUE provides the old behavior of automatic conversion.
DF2 <- data.frame(V1 = 1:5,
V2 = c("a", "a", "b", "a", "d"),
stringsAsFactors = TRUE)
str(DF2)
'data.frame': 5 obs. of 2 variables:
$ V1: int 1 2 3 4 5
$ V2: Factor w/ 3 levels "a","b","d": 1 1 2 1 3
More on data frames Basically a data frame is a list with (an S3) class attribute:
typeof(DF1)
[1] "list"
class(DF1)
[1] "data.frame"
is.data.frame(DF1)
[1] TRUE
Coercion to data frames Lists, vectors and matrices can be coerced to data frames if it is appropriate.
For lists this means that all objects have the same “length”.
V1 <- 1:5
L1 <- list(V1 = V1, V2 = letters[c(1, 2, 3, 2, 1)])
L2 <- list(V1 = V1, V2 = letters[c(1, 2, 3, 2, 1, 3)])
str(as.data.frame(V1))
'data.frame': 5 obs. of 1 variable:
$ V1: int 1 2 3 4 5
str(as.data.frame(M1))
'data.frame': 2 obs. of 3 variables:
$ A: int 1 2
$ B: int 3 4
$ C: int 5 6
str(as.data.frame(L1))
'data.frame': 5 obs. of 2 variables:
$ V1: int 1 2 3 4 5
$ V2: chr "a" "b" "c" "b" ...
str(as.data.frame(L2))
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply dif
42
• The basic functions to combine two data frames (works similar with matrices) are cbind and rbind.
• When combining column-wise, then the numbers of rows must match and row names are ignored (hence
observations need to be in the same order).
• When combining row-wise the number of columns and their names must match.
• For more advanced combining see the function merge.
cbind(DF1, data.frame(new = 6:10))
V1 V2 new
1 1 a 6
2 2 a 7
3 3 b 8
4 4 a 9
5 5 d 10
rbind(DF1, data.frame(V1 = 1, V2 = "c"))
V1 V2
1 1 a
2 2 a
3 3 b
4 4 a
5 5 d
6 1 c
More about cbind Note that cbind (and rbind) try to make matrices when possible. Only if at least one
of the elements to be combined is a data frame the results will be also a data.frame.
Hence vectors can’t usually be combined into a data frame using cbind.
V1 <- 1:3
V2 <- c("a", "b", "a")
str(cbind(V1, V2))
chr [1:3, 1:2] "1" "2" "3" "a" "b" "a"
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "V1" "V2"
43
# does not work
DF5 <- data.frame(a = 1:3, b = list(1:2,1:3,1:4))
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply dif
# does work
DF6 <- data.frame(a = 1:3, b = I(list(1:2,1:3,1:4)))
DF6
a b
1 1 1, 2
2 2 1, 2, 3
3 3 1, 2, 3, 4
Tibbles Due to frustration over same old-school design choice in data.frames, the tibble package (Müller
and Wickham 2021) introduced a modified type of data frames, namely tibbles. They share the same structure
as data frames but the class vector is longer:
attributes(DF1)
$names
[1] "V1" "V2"
$class
[1] "data.frame"
$row.names
[1] 1 2 3 4 5
library("tibble")
DF1t <- tibble(V1 = 1:5,
V2 = c("a", "a", "b", "a", "d"))
attributes(DF1t)
$class
[1] "tbl_df" "tbl" "data.frame"
$row.names
[1] 1 2 3 4 5
$names
[1] "V1" "V2"
Non-syntactic names While data frames automatically transform non-syntactic names (unless
check.names = FALSE), tibbles do not (although they do print non-syntactic names surrounded by “‘).
44
names(data.frame(`1` = 1))
[1] "X1"
names(tibble(`1` = 1))
[1] "1"
Recycling While every element of a data frame (or tibble) must have the same length, both data.frame()
and tibble() will recycle shorter inputs. However, while data frames automatically recycle columns that are
an integer multiple of the longest column, tibbles will only recycle vectors of length one.
data.frame(x = 1:4, y = 1:2)
x y
1 1 1
2 2 2
3 3 1
4 4 2
data.frame(x = 1:4, y = 1:3)
# A tibble: 4 x 2
x y
<int> <dbl>
1 1 1
2 2 1
3 3 1
4 4 1
tibble(x = 1:4, y = 1:2)
Error in `tibble()`:
! Tibble columns must have compatible sizes.
* Size 4: Existing data.
* Size 2: Column `y`.
i Only values of size one are recycled.
Variable reference There is one final difference: tibble() allows you to refer to variables created during
construction:
tibble(
x = 1:3,
y = x * 2
)
# A tibble: 3 x 2
x y
<int> <dbl>
1 1 2
2 2 4
3 3 6
45
1.6.7 NULL
NULL is special data structure because it has a unique type, is always length zero, and can’t have any
attributes:
typeof(NULL)
[1] "NULL"
length(NULL)
[1] 0
x <- NULL
attr(x, "y") <- 1
NULL
c(1, NULL)
[1] 1
or to represent an absent vector (e.g., NULL is often used as a default function argument, when the argument
is optional but the default value requires some computation).
1.6.8 Some more functions for dealing with different data structures
The functions lapply() can be useful when working with lists. It will apply a function FUN to each element
of a list:
L1 <- list(1:5, "a", c(TRUE, FALSE))
lapply(L1, length)
[[1]]
[1] 5
[[2]]
[1] 1
[[3]]
[1] 2
Function apply() is important for working with data frames or with matrices. It applies a function FUN to
each row (if MARGIN = 1) or each column (if MARGIN = 2).
df <- data.frame(Income = c(1000, 1100, 1200, 1300, 1400),
Age = c(20, 19, 54, 45, 24))
## Apply to columns
apply(df, 2, mean)
Income Age
1200.0 32.4
apply(df, 2, function(x) length(x))
46
Income Age
5 5
We add new column to df:
df$Group <- factor(c("A", "B", "A", "A", "B"))
The following command returns a warning, as we cannot compute the mean of a categorical variable.
apply(df, 2, mean)
Income Age
1200.0 32.4
apply(df, 2, mean)
47
sapply(df[, id_num], mean)
Income Age
1200.0 32.4
Note that sapply() can also be used for lists. If the result of the functions applied on each element are of
same length it will return a vector or a matrix (pay attention to coercion!). Otherwise it returns the list. For
example,
sapply(L1, length) # returns a vector
[1] 5 1 2
sapply(L1, summary) # returns a list
[[1]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 2 3 3 4 5
[[2]]
Length Class Mode
1 character character
[[3]]
Mode FALSE TRUE
logical 1 1
The function tapply() is also very useful when working with data frame. It applies a function for each factor
variable in a vector.
tapply(df$Income, df$Group, mean)
A B
1167 1250
For computing summary statistics for each column of the data frame the function summary() can be used:
summary(df)
48
3. logical indexing using logical vectors.
4. named indexing using character vectors.
5. blank indexing.
6. zero indexing.
1.7.1.1 Positive indexing of atomic vectors One can index an atomic vector by specifying in square
brackets the position of the elements which should be selected.
V1 <- c(1, 3, 2.5, 7.2, -3.2)
# basic version
V1[c(1, 3)]
[1] 1.0 2.5
# same elements can be selected multiple times
V1[c(1, 3, 1, 3, 1, 3, 1)]
[1] 1.0 2.5 1.0 2.5 1.0 2.5 1.0
# double valued indices are truncated to integers
V1[c(1.1, 3.9)]
[1] 1.0 2.5
1.7.1.2 Negative indexing of atomic vectors One can index an atomic vector by specifying in square
brackets the positions of the elements which should not be selected.
V1[-c(2, 4, 5)]
[1] 1.0 2.5
1.7.1.3 Logical indexing of atomic vectors Giving in square brackets a logical vector of the same
length means that the elements with value TRUE will be selected.
# basic version
V1[c(TRUE, FALSE, TRUE, FALSE, FALSE)]
[1] 1.0 2.5
# if the logical vector is too short,
# it will be recycled.
V1[c(TRUE, FALSE, TRUE)]
[1] 1.0 2.5 7.2
# most common is to use expression
# which return a logical vector
V1[V1 < 3]
[1] 1.0 2.5 -3.2
1.7.1.4 Named indexing of atomic vectors Giving in square brackets a character vector of the names
which should be selected.
names(V1) <- letters[1:5]
# basic version
V1[c("a", "c")]
a c
1.0 2.5
# same elements can be selected multiple times
V1[c("a", "c", "a", "c", "a", "c", "a")]
a c a c a c a
49
1.0 2.5 1.0 2.5 1.0 2.5 1.0
# names are matched exactly
V1[c("a", "c", "ab", "z")]
a c <NA> <NA>
1.0 2.5 NA NA
1.7.1.5 Blank and zero indexing of atomic vectors Blank indexing is not useful for atomic vectors
but will be relevant for higher dimensional objects. It returns in this case the original atomic vector.
V1[]
a b c d e
1.0 3.0 2.5 7.2 -3.2
Zero indexing returns in this case a zero length vector. It is often used when generating testing data.
V1[0]
named numeric(0)
1.7.3.1 Subsetting matrices with two vectors Using one vector for the rows and one vector for the
columns:
M1 <- matrix(1:6, ncol = 3)
rownames(M1) <- LETTERS[1:2]
colnames(M1) <- letters[1:3]
50
M1[c(TRUE, FALSE), c("b", "c")]
b c
3 5
M1[-2, ]
a b c
1 3 5
1.7.3.2 Subsetting matrices with one vector As matrices (arrays) are essentially vectors with a
dimension attribute, also a single vector can be used to extract elements. For this it is important that
matrices (arrays) filled in column major order.
M2 <- outer(1:5, 1:5, paste, sep = ",")
M2
[,1] [,2] [,3] [,4] [,5]
[1,] "1,1" "1,2" "1,3" "1,4" "1,5"
[2,] "2,1" "2,2" "2,3" "2,4" "2,5"
[3,] "3,1" "3,2" "3,3" "3,4" "3,5"
[4,] "4,1" "4,2" "4,3" "4,4" "4,5"
[5,] "5,1" "5,2" "5,3" "5,4" "5,5"
M2[c(3, 17)]
[1] "3,1" "2,4"
1.7.3.3 Subsetting matrices with a matrix This is rarely done but possible. To select elements from
an n-dimensional object, a matrix with n columns can be used. Each row of the matrix specifies one element.
The result will always be a vector. The matrix can consist of integers or of characters (if the array is named).
M3 <- matrix(ncol = 2, byrow = TRUE,
data = c(1, 4,
3, 3,
5, 1))
M2[M3]
[1] "1,4" "3,3" "5,1"
51
1 4 7 o
3 6 5 q
To select columns:
# like a matrix
DF1[, c("a","c")]
a c
1 4 o
2 5 p
3 6 q
# like a list
DF1[c("a","c")]
a c
1 4 o
2 5 p
3 6 q
52
Assume we want to extract individually the three parameters of the model.
str(fit1)
List of 12
$ coefficients : Named num [1:2] 2.982 0.531
..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
$ residuals : Named num [1:100] 0.0495 -0.0549 0.0342 -0.1234 0.1549 ...
..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
$ effects : Named num [1:100] -32.572 1.414 0.028 -0.137 0.157 ...
..- attr(*, "names")= chr [1:100] "(Intercept)" "x" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:100] 3.12 3.18 3.29 3.46 3.09 ...
..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:100, 1:2] -10 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:100] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:2] "(Intercept)" "x"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.1 1.05
..$ pivot: int [1:2] 1 2
..$ tol : num 1e-07
..$ rank : int 2
..- attr(*, "class")= chr "qr"
$ df.residual : int 98
$ xlevels : Named list()
$ call : language lm(formula = y ~ x)
$ terms :Classes 'terms', 'formula' language y ~ x
.. ..- attr(*, "variables")= language list(y, x)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "y" "x"
.. .. .. ..$ : chr "x"
.. ..- attr(*, "term.labels")= chr "x"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(y, x)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "y" "x"
$ model :'data.frame': 100 obs. of 2 variables:
..$ y: num [1:100] 3.17 3.12 3.32 3.34 3.24 ...
..$ x: num [1:100] 0.266 0.372 0.573 0.908 0.202 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language y ~ x
.. .. ..- attr(*, "variables")= language list(y, x)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "y" "x"
.. .. .. .. ..$ : chr "x"
.. .. ..- attr(*, "term.labels")= chr "x"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
53
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(y, x)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
- attr(*, "class")= chr "lm"
# the intercept
fit1$coefficients[1]
(Intercept)
2.982
# the slope
fit1$coefficients[2]
x
0.5312
# sigma needs to be computed
sqrt(sum((fit1$residuals-mean(fit1$residuals))ˆ2)
/fit1$df.residual)
[1] 0.09411
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-0.18498 -0.05622 -0.00871 0.05243 0.25166
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.9821 0.0206 145 <2e-16 ***
x 0.5312 0.0353 15 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
54
$ a: int [1:2] 1 2
$ b: chr [1:3] "a" "b" "c"
$ c: logi [1:2] TRUE FALSE
L1[[1]]
[1] 1 2
L1[1]
$a
[1] 1 2
L1$a
[1] 1 2
str(L1[[1]])
int [1:2] 1 2
str(L1[1])
List of 1
$ a: int [1:2] 1 2
str(L1$a)
int [1:2] 1 2
If [[ is used with a vector of integers or characters then it is assuming nested list structures.
L2 <- list(a = list(A = list(aA = 1:3, bB = 4:6),
B = "this"),
b = "that")
str(L2)
List of 2
$ a:List of 2
..$ A:List of 2
.. ..$ aA: int [1:3] 1 2 3
.. ..$ bB: int [1:3] 4 5 6
..$ B: chr "this"
$ b: chr "that"
L2[[c("a","A","aA")]]
[1] 1 2 3
Simplification vs preservation As the different subsetting operators have different properties simplifying
or preservation needs to be remembered at all times as it can have huge impact in programming.
In doubt it is usually better not to simplify. As it is then better that an object is always of the type it was
originally.
To prevent or force simplification, the argument drop can be specified in [.
Simplification Preservation
vector x[[1]] x[1]
list x[[1]] x[1]
factor x[ind, drop=TRUE] x[1]
matrix x[1,] or x[,1] x[ind, , drop = FALSE]
or x[,ind, drop = FALSE]
data frame x[,1] or x[[1]] x[ , 1, drop = FALSE] or x[1]
here ind is an indexing vector of positive integers and naturally arrays behave the “same” as matrices.
What does simplification mean for atomic vectors? Simplification for atomic vectors concerns the loss of
names.
55
V1 <- c(a = 1, b = 2, c = 3)
V1[1]
a
1
V1[[1]]
[1] 1
56
[1] 2 3 1
A1[ , , 1]
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
dim(A1[ , , 1])
[1] 2 3
Missing and out-of-bounds indices It is useful to understand what happens with [[ when one uses an
“invalid” index.
An invalid index can be:
• missing (NA)
• zero-length
• out of bounds (integer)
• out of bounds (character)
Some inconsistencies happen among the different data types:
57
x <- list(
a = list(1, 2, 3),
b = list(3, 4, 5)
)
y <- c(1, 2, 3)
## missing index
x[[NA]]
NULL
y[[NA]]
NULL
y[["a"]]
Subsetting and assignment All subsetting operators can be combined with assigning values to the
selected parts.
x <- 1:6
x
[1] 1 2 3 4 5 6
x[1] <- 20
x
[1] 20 2 3 4 5 6
x[-1] <- 51:55
x
[1] 20 51 52 53 54 55
x[c(1,1)] <- c(-10,-20)
x
[1] -20 51 52 53 54 55
## Logical & NA indexing can be combined!
## It is be recycled:
58
x[c(TRUE, FALSE, NA)] <- 1
x
[1] 1 51 52 1 54 55
[1] 1 2
The condition should evaluate to a single TRUE or FALSE. Most other inputs will generate an error:
if ("x") 1
59
if (logical()) 1
[1] 1
if (0) 1 else 2
[1] 2
works.
Exercise: Check how the coercion to logicals works. What does as.logical(0), as.logical(0.1) and
as.logical(10) return? Why does the following work:
x <- 1:10
if (length(x)) "not empty" else "empty"
[1] "empty"
A logical vector of length greater than 1 generates an error since R 4.2.0 (see https://cran.rstudio.com/doc
/manuals/r-patched/NEWS.pdf). Before this version it would only generate a warning.
if (c(TRUE, FALSE)) 1
1.8.1.2 Vectorized if Statement if only works with a single TRUE or FALSE value. If one has a vector
of logical values, then the ifelse() function can be used.
The basic syntax is:
ifelse(EXPR, yes, no)
The result is a vector of same length as EXPR that has as corresponding entry the value of yes if EXPR is
TRUE, of no if EXPR is FALSE.
x <- 1:4
ifelse(x < 2.5, "yes", "no")
60
[1] "yes" "no" "yes"
Also the attributes from EXPR will be kept and only the entries replaced.
x <- 1:4
names(x) <- c("x1", "x2", "x3", "x4")
ifelse(x < 2.5, "yes", "no") # keeps the names
x1 x2 x3 x4
"yes" "yes" "no" "no"
Missing values in EXPR remain missing values.
x <- 1:4
x[3] <- NA
ifelse(x < 2.5, "yes", "no")
1.8.1.3 switch statement Another possibility for conditional execution is the function switch. It is
especially useful when the there are more than two possibilities or if the options are named. The basic syntax
is:
switch(EXPR, options)
where EXPR can be an integer value which says which option should be chosen, alternatively it can be a
character string if the options are named. The recommendation is to typically use switch() with character
inputs.
For example, it can be used to replace the following code containing nested if statements
x <- "a"
if (x == "a") {
"option 1"
} else if (x == "b") {
"option 2"
} else if (x == "c") {
"option 3"
} else {
stop("Invalid `x` value")
}
61
Error: Input not found
[1] 1
[1] 2
[1] 3
Common pitfalls
If for each value a result is created which should be stored, then it is recommended to create first an object of
the appropriate length which is used to store the results. The numeric() or vector() functions are helpful
here.
out <- numeric(3)
for (i in 1:3) {
out[i] <- i * 10 + 1
}
out
[1] 11 21 31
Pay attention to iterating over S3 vectors, as loops typically strip the attributes:
xs <- as.Date(c("2020-01-01", "2010-01-01"))
for (x in xs) {
print(x)
}
[1] 18262
[1] 14610
Pay attention to iterating over 1:length(x) when length of x is zero (the index can also run backwards):
out <- c()
for (i in 1:length(out)) {
print(i)
}
[1] 1
[1] 0
In such cases, it is better to use seq_along(x) instead. It always returns a value the same length as x:
out <- c()
for (i in seq_along(out)) {
print(i)
}
To compute in R the first 10 Fibonacci numbers we can use a for loop in the following way:
62
Fib <- numeric(10) ## create a vector which will store numeric elements
Fib[1] <- 1
Fib[2] <- 1
for (i in 3:10) {
Fib[i] <- Fib[i-1] + Fib[i-2]
}
Fib
[1] 1 1 2 3 5 8 13 21 34 55
1.8.2.2 while loop The while loop can be used when statements have to be repeated but is not known
in advance how often exactly. The computations should be continued as long as a condition is fullfilled.
The syntax looks like:
while (condition) {
statements
}
Hence here condition is evaluated and if FALSE nothing will be done. If the condition is however TRUE, then
the statements are executed. After the statements are executed, the condition is again evaluated. –>
To compute for example all Fibonacci numbers smaller than 100 we could use
Fib1 <- 1
Fib2 <- 1
Fibs <- c(Fib1)
while (Fib2 < 100) {
Fibs <- c(Fibs, Fib2)
oldFib2 <- Fib2
Fib2 <- Fib1 + Fib2
Fib1 <- oldFib2
}
Fibs
[1] 1 1 2 3 5 8 13 21 34 55 89
Note: increasing the length of a vector can be costly for R! Avoid if possible.
1.8.2.3 repeat loop If a loop is needed which does not go through a prespecified number of iterations or
should not have a condition check at the top the repeat loop can be used. The syntax looks like:
repeat {
statements
}
This causes the statement to be repeated endlessly. Therefore a terminator called break needs to be included.
It is usually included as:
if (condition) break
In general the break command can be used in any loop and it causes the loop to terminate immediately.
Similarly, the command next can also be used in any loop and causes that the computations of the current
iteration are terminated immediately and the next iteration is started from the top. The repeat loop and
the functions break and next are rarely used since it is much easier to read and understand programs using
the other looping methods.
To compute for example all Fibonacci numbers smaller than 100 we could use also
63
Fib1r <- 1
Fib2r <- 1
Fibsr <- c(Fib1r)
repeat {
Fibsr <- c(Fibsr, Fib2r)
oldFib2r <- Fib2r
Fib2r <- Fib1r + Fib2r
Fib1r <- oldFib2r
if (Fib2r > 100) break
}
Fibsr
[1] 1 1 2 3 5 8 13 21 34 55 89
1.9 Functions
R is a functional programming language. As John Chambers (creator of S) put it:
Everything that exists is an object. Everything that happens is a function call.
Functions are fundamental building blocks in R and are self contained units of code with a well-defined
purpose. To create a function function() is used. The parentheses enclose the arguments list. Then a single
statement or multiple statements enclosed by {} are specified.
When R executes a function definition it produces an object with three parts:
1. body: the code inside the function.
2. formals: the list of arguments which specify how to call the function.
3. environment: the data structure that determines how the function finds the values associated with the
names.
When printing the function it will display these parts (if the environment is not shown it is the global
environment).
To reduce the burden for the user, one can give default values to some arguments:
f <- function(x, y = 1) {
z <- x + y
2 * z
}
f
function (x, y = 1)
{
z <- x + y
2 * z
}
formals(f)
$x
$y
[1] 1
64
body(f)
{
z <- x + y
2 * z
}
environment(f)
<environment: R_GlobalEnv>
There is one exception of a group of functions which have not the three parts just described - these are called
primitive functions.
All primitive functions are located in the base package. They call directly C code and do not contain any R
code.
sum
function (..., na.rm = FALSE) .Primitive("sum")
formals(sum)
NULL
body(sum)
NULL
environment(sum)
NULL
As mentioned, every operation in R is a function call. Really, every operation in R is a function call.
So also +, -, *, [, $, {, for. . . are functions. To demonstrate how some operators are actually functions
check the following code:
x <- 10
y <- 20
x + y
[1] 30
'+'(x, y)
[1] 30
65
x <- 10
y <- 20
g01 <- function() {
x <- 1
y <- 2
c(x, y)
}
g01()
[1] 1 2
x <- 10
y <- 20
g02 <- function() {
y <- 2
c(x, y)
}
g02()
[1] 10 2
The same rules apply if a function is defined inside another function. First, R looks inside the current function.
Then, it looks where that function was defined (and so on, all the way up to the global environment). Finally,
it looks in other loaded packages.
x <- 1
g03 <- function() {
y <- 2
i <- function() {
z <- 3
c(x, y, z)
}
i()
}
g03()
[1] 1 2 3
In R, functions are ordinary objects. This means the scoping rules described above also apply to functions:
g07 <- function(x) x + 1
g08 <- function() {
g07 <- function(x) x + 100
g07(10)
}
g08()
[1] 110
However, when a function and a non-function share the same name (they must, of course, reside in different
environments), applying these rules gets a little more complicated. When you use a name in a function call,
R ignores non-function objects when looking for that value.
c <- 1
c(a = 1, c = c)
a c
1 1
66
Exercise: Check the following code and explain why f1(3) and f2(3) return different values.
y <- 10
f1 <- function(x) {
g <- function(x) {
x * y
}
y <- 2
y ˆ 2 + g(x)
}
f1(3)
f2 <- function(x) {
y <- 2
y ˆ 2 + g(x)
}
g <- function(x) {
x * y
}
f2(3)
67
str(f(Position1 = 1, Pos2 = 2, Pos3 = 3))
List of 3
$ pos1: num 1
$ pos2: num 2
$ pos3: num 3
str(f(1, 2, 3))
List of 3
$ pos1: num 1
$ pos2: num 2
$ pos3: num 3
# good calls
mean(1:10)
[1] 5.5
mean(1:10, trim=0.2)
[1] 5.5
# confusing calls
mean(1:10, n=T)
[1] 5.5
mean(1:10, , FALSE)
[1] 5.5
mean(1:10, 0.2)
[1] 5.5
mean(, , TRUE, x=1:10)
[1] 5.5
68
In programming you will often have the case, that the arguments to call a function are specified in list. How
to use that then for the actual function call?
The solution is the function do.call.
L1 <- list(1:10, na.rm = TRUE)
do.call(mean, L1)
[1] 5.5
Missing arguments Sometimes reasonable default values might be complicated and lengthy expressions
and difficult to add in the function definition.
Then basically in the inside of the function the default is computed only if the argument was not user
specified. To check if an argument is missing, the function missing is used.
In R it is however more common to set in that case arguments to NULL as default and then check inside the
function if the argument is NULL using is.null.
f1 <- function(a, b) {
if (missing(a)) a <- "default"
if (missing(b)) b <- "default"
list(a = a, b = b)
}
str(f1())
List of 2
$ a: chr "default"
$ b: chr "default"
str(f1(1, 2))
List of 2
$ a: num 1
$ b: num 2
f2 <- function(a = NULL, b = NULL) {
if (is.null(a)) a <- "default"
if (is.null(b)) b <- "default"
list(a = a, b = b)
}
str(f2())
List of 2
$ a: chr "default"
$ b: chr "default"
str(f2(1, 2))
List of 2
$ a: num 1
$ b: num 2
69
f1 <- function(x) {
if (x < 0) return("not positive")
if (x < 5) {
"between 0 and 5"
} else {
"larger than 5"
}
}
f1(-1)
[1] "not positive"
f1(1)
[1] "between 0 and 5"
f1(10)
[1] "larger than 5"
1.9.4.1 Invisible return It is possible to return objects from a function call which are not printed by
default using the invisible function.
Invisible output can be assigned to an object and/or forced to be printed by putting the function call between
round parentheses.
f1 <- function() 1
f2 <- function() invisible(1)
f1()
[1] 1
f2()
1.9.4.2 The pipe operator The magrittr package defines the pipe operator %>% and many other
packages also make use of it.
Rather than typing f(x, y) we type x %>% f(y) (start with x then use f(y) to modify it).
R 4.1.x contains a base R pipe |> with the same syntax:
x <- 1:4; y <- 4
sum(x, y)
[1] 14
x |> sum(y)
[1] 14
x |> mean()
[1] 2.5
70
1.10 Efficient coding, debugging, benchmarking
This section has been adapted from lecture notes by Naim Rashid and Michael Love. For more on this topic
you can have a look at Efficient R Programming.
71
• Caching variables and memoization: A straightforward method for speeding up code is to calculate
objects once and reuse the value when necessary. This could be as simple as replacing sd(x) in multiple
function calls with the object sd_x <- sd(x) that is defined once and reused. If you happen to be
writing a function that will take input which has repeated elements, and it is very important for this
function to be very fast, memoization can be a useful technique. Memoization entails storing the values
of expensive function calls so that they don’t have to be repeated. There is a small overhead in saving
and looking up the precomputed values, but if the degree of repeated input is high, the savings of
memoization can be large. The memoise package helps with this in R.
• Parallelization: One can think of the standard for loop as a serial operation: the (i + 1)th iteration is
always ran after the (i)th iteration has completed. On machines with more than one CPU available, say
with P CPUs, parallelization of iterations may allow for the execution of up to P iterations of the for
loop at the same time. This is particularly helpful when there are a large number of loop iterations that
are non-recursive, meaning the next iteration of the loop does not depend on the prior one (simulations
for example). One thing to note is that each parallel instance invoked in R requires additional memory,
and therefore in memory-intensive operations this may quickly exhaust the available memory on your
machine if you are not careful.
Two notes on avoiding making copies, from Section 2.5 in Advanced R:
For loops have a reputation for being slow in R, but often that slowness is caused by every
iteration of the loop creating a copy. Consider the following code. It subtracts the median from
each column of a large data frame:
x <- data.frame(matrix(runif(5 * 1e4), ncol = 5))
medians <- sapply(x, median)
for (i in seq_along(medians)) {
x[[i]] <- x[[i]] - medians[[i]]
}
We can use the tracemem function to understand if and how many copies are being done in R for this for
loop:
cat(tracemem(x), "\n")
<0x12ee0b3a8>
for (i in seq_along(medians)) {
x[[i]] <- x[[i]] - medians[[i]]
}
tracemem[0x12ee0b3a8 -> 0x1183a32c8]: eval eval withVisible withCallingHandlers eval eval with_handlers
tracemem[0x1183a32c8 -> 0x1183a33a8]: [[<-.data.frame [[<- eval eval withVisible withCallingHandlers eva
tracemem[0x1183a33a8 -> 0x1183a3488]: eval eval withVisible withCallingHandlers eval eval with_handlers
tracemem[0x1183a3488 -> 0x1183a3568]: [[<-.data.frame [[<- eval eval withVisible withCallingHandlers eva
tracemem[0x1183a3568 -> 0x118277038]: eval eval withVisible withCallingHandlers eval eval with_handlers
tracemem[0x118277038 -> 0x1182773b8]: [[<-.data.frame [[<- eval eval withVisible withCallingHandlers eva
tracemem[0x1182773b8 -> 0x1182770a8]: eval eval withVisible withCallingHandlers eval eval with_handlers
tracemem[0x1182770a8 -> 0x1184538a8]: [[<-.data.frame [[<- eval eval withVisible withCallingHandlers eva
tracemem[0x1184538a8 -> 0x118432678]: eval eval withVisible withCallingHandlers eval eval with_handlers
tracemem[0x118432678 -> 0x1184330f8]: [[<-.data.frame [[<- eval eval withVisible withCallingHandlers eva
untracemem(x)
In fact, each iteration copies the data frame not once, but twice! We can reduce the number of copies by
using a list instead of a data frame. (modifying a list uses internal C code and deals better with counting
bindings, see Section 2.5 in Advanced R for more details).
72
y <- as.list(x)
cat(tracemem(y), "\n")
<0x10ed508a8>
for (i in 1:5) {
y[[i]] <- y[[i]] - medians[[i]]
}
tracemem[0x10ed508a8 -> 0x10f683008]: eval eval withVisible withCallingHandlers eval eval with_handlers
untracemem(y)
# A tibble: 2 x 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 sqrt(1:1000) 1.39us 2.54us 387694. 15.72KB 116.
2 slow.sqrt(1:1000) 762.6us 1.05ms 959. 3.88MB 84.4
This benchmark indicates that the vectorized version of the square root is many orders of magnitude faster
than a naive implementation. Let’s compare slow.sqrt with a version where we preallocate the vector that
stores the eventual output of the function:
pre.sqrt <- function(x) {
ans <- numeric(length(x))
for (i in seq_along(x)) {
ans[i] <- sqrt(x[i])
}
ans
}
mark(pre.sqrt(1:1000), slow.sqrt(1:1000))
73
# A tibble: 2 x 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 pre.sqrt(1:1000) 40.4us 41.29us 23967. 30.05KB 2.40
2 slow.sqrt(1:1000) 754.6us 1.02ms 984. 3.86MB 88.2
So simply pre-allocating saves us about an order of magnitude in speed.
1.10.3.2 Profiling If you know that your function is correct but think it is slow you can do profiling
which helps to identify the parts of the functions which are bottlenecks and then you can consider if these
parts could be improved.
The idea in profiling is that the software checks in very short intervals which function is currently running. The
main functions in R to do profiling are Rprof and summaryRprof. But there are also many other specialized
packages for this purpose. We will look only at the package profvis (Chang, Luraschi, and Mastny 2020).
A function to profile
Stest <- function(n = 1000000, seed = 1) {
set.seed(seed)
normals <- rnorm(n*10)
X <- matrix(normals, nrow=10)
Y <- matrix(normals, ncol=10)
XXt <- X %*% t(X)
XXcp <- tcrossprod(X)
return(n)
}
system.time(Stest())
user system elapsed
0.644 0.035 0.687
Note that the user CPU time gives the CPU time spent by the current process (i.e., the current R session)
and system CPU time” gives the CPU time spent by the kernel (the operating system) on behalf of the
current process. The operating system is used for things like opening files, doing input or output, starting
other processes, and looking at the system clock: operations that involve resources that many processes must
share. (adapted from an old answer from William Dunlap on a question in the R-help mailing list).
Rprof(interval = 0.01)
Stest()
[1] 1e+06
Rprof(NULL)
summaryRprof()$by.self
self.time self.pct total.time total.pct
"rnorm" 0.27 57.45 0.27 57.45
"%*%" 0.07 14.89 0.07 14.89
"tcrossprod" 0.06 12.77 0.06 12.77
"matrix" 0.04 8.51 0.04 8.51
"t.default" 0.02 4.26 0.02 4.26
"readBin" 0.01 2.13 0.01 2.13
Run it on your own computer and look at the full output of summaryRprof().
The profvis package yields interactive results displayed in html and therefore this again should be done in R
(best RStudio) and cannot be visualized in the lecture notes.
if (!require("profvis")) install.packages("profvis")
library("profvis")
profvis({
74
Stest <- function(n=1000000, seed=1){
set.seed(seed)
normals <- rnorm(n*10)
X <- matrix(normals, nrow=10)
Y <- matrix(normals, ncol=10)
XXt <- X %*% t(X)
XXcp <- tcrossprod(X)
return(n)
}
Stest()
})
1.10.4 Debugging
There are two commonly referred to claims:
1. Programmers spent more time on debugging their own code that actually programming it.
2. In every 20 lines of code is at least one bug.
Hence debugging is an essential part of programming and there are strategies and tools available in R to do
this well in R. In the following we introduce several strategies.
Top-down programming The general agreement is that good code is written in a modular manner.
This means when you have a procedure to implement, you decompose it into small parts where each part
will become an own function. Then the main function is “short” and will consist mainly of calling these
subfunctions. Naturally also within these functions the same approach is to be taken (see Section on Readable
code).
Then same approach is followed in debugging. First the top level function is debugged and all subfunctions
are assumed correct. If this does not yield a solution, then the next level is debugged and so on.
Small start strategy The small start strategy in debugging suggests to start using small test cases for
the debugging. Once these work fine, then consider larger testing cases. At that stage also extreme cases
should be tested.
Antibugging Also some antibugging strategies are useful in this context. Assume that at line n in your
code you know that variable or vector x must have some specific property, like being positive or sum up to 1.
Then you can add in that line in the code for debugging purposes for example
stopifnot(x > 0)
or
stopifnot(sum(x) == 1)
R functions for debugging R provides many functions to help in the debugging process. To name some:
• browser
• debug and undebug
• debugger
• dump.frames
• recover
• trace and untrace
75
For details about these functions see their help pages. In the following we will look only at debug and
traceback.
Note that also Rstudio offers special debugging tools, see https://support.rstudio.com/hc/en-us/articles/20
5612627-Debugging-with-RStudio for details.
traceback Often when using functions and error occurs it is not really clear where the actually error occurs,
which (sub)function caused the error. One strategy then is to use the traceback function, which returns
when called directly after the erroneous call the sequence of function calls which lead to the error.
f1 <- function(x) f2(x)ˆ2
f2 <- function(x) log(x) + "x"
mainf <- function(x) {
x <- f1(x)
y <- mean(x)
y
}
mainf(1:3)
traceback()
debug Assume you have a function foo you assume faulty. Using then
debug(foo)
will open whenever the function is called the “browser” until either the function is changed or the debugging
mode terminated using
undebug(foo)
In the “browser” the function will be executed line by line where always the next to be executed line will
be shown.
In the browsing mode the following commands have a special meaning:
• n (or just hitting enter) will execute the line shown and then present the next line to be executed.
• c this is almost like n just that it might execute several lines of code at once. For example if you are in
a loop then c will jump to the next iteration of the loop.
• where this prints a stack trace, the sequence of function calls which led the execution to the current
location
• Q this quits the browser.
And in the browser mode any other R command can be used. However to see for example the value of a
variable nthe variable needs then to explicitly printed using print n. In a demo we will go through the
following function in debugging mode
SimuMeans <- function(m, n = 100, seed = 1) {
set.seed(seed)
RES <- matrix(0, nrow = m, ncol = 3)
for (i in 1:m){
X <- cbind(rnorm(n), rt(n,2), rexp(n))
for (j in 1:3){
RES[i,j] <- mean(X[,j])
}
76
print(paste(i,Sys.time()))
}
return(RES)
}
debug(SimuMeans)
SimuMeans(5)
Capturing errors Especially in simulations it is often desired that when an error occurs that not the
whole process is terminated but that the error is caught and an appropriate record made but otherwise the
simulations should continue. R has for this purpose the function try and tryCatch where we will consider
only tryCatch.
The idea of tryCatch is to run the “risky” part where errors might occur within the tryCatch call and tell
tryCatch what to return in the case of an error.
Consider a modified version of our previous simulation function:
my.mean <- function(x){
na.fail(x)
mean(x)
}
SimuMeans2 <- function(m, n=100, seed=1) {
set.seed(seed)
RES <- matrix(0, nrow=m, ncol=3)
for (i in 1:m){
X <- cbind(rnorm(n), rt(n,2), rexp(n))
if (i==3) X[1,1] <- NA
for (j in 1:3){
RES[i,j] <- my.mean(X[,j])
}
}
return(RES)
}
SimuMeans2(5)
Error in na.fail.default(x): missing values in object
Using tryCatch:
SimuMeans3 <- function(m, n=100, seed=1) {
set.seed(seed)
RES <- matrix(0, nrow=m, ncol=3)
for (i in 1:m){
X <- cbind(rnorm(n), rt(n,2), rexp(n))
if (i==3) X[1,1] <- NA
for (j in 1:3){
RES[i,j] <- tryCatch(my.mean(X[,j]), error = function(e) NA)
}
}
return(RES)
}
SimuMeans3(5)
[,1] [,2] [,3]
[1,] 0.10889 -0.29099 1.1103
[2,] -0.04921 -0.17200 0.8624
[3,] NA -0.02305 1.0302
77
[4,] -0.09209 -0.27303 1.0814
[5,] -0.05374 0.13526 1.0200
This function searches following the search path for a dataset with the corresponding name. A list of all
datasets currently available can retrieved submitting only
data()
[1] "/Users/lauravanagur/GitHub/105731_aksta_statistical_computing/Script"
When opening a file with RStudio, it automatically sets the working directory to the location of the file. In R
Markdown it is automatically the location of the .Rmd.
Note that paths can be specified in an absolute way or in a relative way (i.e., relative to the working directory).
Assume I want to change my working directory to /Users/lauravanagur/Documents/Teaching/Statistical
Computing/ (one level above the current working directory):
## absolute
setwd("/Users/lauravanagur/Documents/Teaching/Statistical Computing/")
## relative
setwd("../")
where ../ means folder one level up. Note that in an .Rmd it is not advisable to use such commands to
change the working directory.
Relative paths are useful for reproducibility. They ensure that if another user has the same folder structure,
they can use the same paths. We will see some more examples later.
78
an observation and each column for a variable. This kind of data is often referred to as flat data and can
come in different file formats, e.g., .csv, .txt, .xlsx. R can also import data from other statistical software
platforms such as STATA, SPSS, SAS. Other forms include XML or HTML data, which is highly structured
but not flat data.
An important criterion for the choice of the data import function is the size of the data. If the data set
exceeds the size of your memory you cannot load the entire data set into your workspace and will have to
work with databases. This section is supposed to give a basic overview over the different kind of functions in
R used to import data into the workspace.
79
Import worked well!
The functionality of the function read_csv and read_tsv corresponds more or less to the functions read.csv
and read.delim of the utils package, however, it is
• faster
• the argument stringsAsFactors is set to FALSE by default (also the case in the utils package since
R.4.x.x).
• output is a tibble
read_delim(file, delim, quote = "\"", escape_backslash = FALSE,
escape_double = TRUE, col_names = TRUE, col_types = NULL,
locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
comment = "", trim_ws = FALSE, skip = 0, n_max = Inf,
guess_max = min(1000, n_max), progress = show_progress())
New names:
Rows: 150 Columns: 6
-- Column specification
-------------------------------------------------------- Delimiter: "," chr
(1): Species dbl (5): ...1, Sepal.Length, Sepal.Width, Petal.Length,
Petal.Width
i Use `spec()` to retrieve the full column specification for this data. i
Specify the column types or set `show_col_types = FALSE` to quiet this message.
* `` -> `...1`
str(iris2)
80
$ ...1 : num [1:150] 1 2 3 4 5 6 7 8 9 10 ...
$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : chr [1:150] "setosa" "setosa" "setosa" "setosa" ...
- attr(*, "spec")=
.. cols(
.. ...1 = col_double(),
.. Sepal.Length = col_double(),
.. Sepal.Width = col_double(),
.. Petal.Length = col_double(),
.. Petal.Width = col_double(),
.. Species = col_character()
.. )
- attr(*, "problems")=<externalptr>
The functions in readr print more output to the console which help the user verify that the data has been
correctly imported. Notice for example that the Species column was not saved imported as a but remained
a character string.
2.1.3 data.table
The data.table package (Dowle and Srinivasan 2021) has one objective: to increase the speed of importing
data in R. It is
• very fast and
• produces objects of class data.table, a special kind of data.frame.
library("data.table")
81
excel_sheets(path)
New names:
* `` -> `...1`
tibble [150 x 6] (S3: tbl_df/tbl/data.frame)
$ ...1 : num [1:150] 1 2 3 4 5 6 7 8 9 10 ...
$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : chr [1:150] "setosa" "setosa" "setosa" "setosa" ...
The example file iris.xlsx has only one sheet:
excel_sheets("data/iris.xlsx")
[1] "iris"
The easiest way to import all spreadsheets from an excel file and save them into a list is by using lapply():
my_workbook <- lapply(excel_sheets("data/iris.xlsx"), read_excel,
path = "data/iris.xlsx")
Currently, there is no possibility to write .xlsx files from the R console with this package. Other packages,
however, provide this function.
82
also the structure of the content can make a file self-describing, so that one does not need to provide these
details to the software reading the data.
The eXtensible Markup Language - more commonly known simply as XML - can be used to provide such
structure, not only for standard datasets but also for more complex data structures. XML is becoming
extremely popular and is emerging as a standard for general data markup and exchange. It is being used by
different communities to describe geographical data such as maps, graphical displays, mathematics etc. The
function necessary to parse an XML document, xmlInternatTreeParse() or xmlTreeParse(), are in the
XML package (Temple Lang 2021) and can be installed via install.packages("XML")
library("XML")
A new development is the ability to interact with databases using exactly the same syntax used to interact
with R objects stored in RAM. This innovation was made possible by dplyr (Wickham et al. 2022), an R
library for data processing that aims to provide a unified “front end” to perform a wide range of analysis
tasks on datasets using a variety of “back ends” which do the number crunching. This is one of the main
advantages of dplyr. It translates the data wrangling function from the dplyr package to SQL queries. Thus,
your R code is translated into SQL and executed in the database, not in R. When working with databases,
dplyr tries to be as lazy as possible:
• It never pulls data into R unless you explicitly ask for it.
• It delays doing any work until the last possible moment: it collects together everything you want to do
and then sends it to the database in one step.
83
2.2 Introduction to tidy data with tidyr
“Happy families are all alike; every unhappy family is unhappy in its own way.”
—- Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its own way.”
—- Hadley Wickham
(excerpt from R for Data Science, Chapter 12)
The content of this chapter is based on the 2014 paper Tidy Data by Hadley Wickham published in the
Journal of Statistical Software.
It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and
Johnson 2003). This process is often referred to as data wrangling. It includes many steps, such as outlier
checking and data imputation, reformatting variables – especially dates, etc. The tidyr package (Wickham
and Girlich 2022) provides functions for one of these tasks: tidying i.e., structuring datasets to facilitate
analysis.
The principles of tidy data provide a standard way to organize data values within a dataset. A standard
makes initial data cleaning easier because you do not need to start from scratch and reinvent the wheel every
time. The tidy data standard has been designed to facilitate initial exploration and analysis of the data, and
to simplify the development of data analysis tools that work well together.
The paper provides a set of principles of how to structure data, that is the fundamental approach for all
packages in the tidyverse. We will also use the dplyr and the ggplot2 (Wickham 2016) packages which
share these common ideas.
To load all packages in the tidyverse you can use:
library("tidyverse")
However, you can also load the individual packages where and when you need them (this instructor’s preferred
approach. . . loading unnecessary package can overwrite functions with the same name and can blind-side
you).
NOTE Before we get started, remember the pipe operator introduced in 1.9.4.2. The tidyverse packages
make use the magrittr pipe %>% so make sure you familiarize yourself with it!
treatmenta treatmentb
John Smith - 2
Jane Doe 16 11
84
treatmenta treatmentb
Mary Johnson 3 1
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy
depending on how rows, columns and tables are matched up with observations, variables and types. In tidy
data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
Table 13 is the tidy version of Table 11. Each row represents an observation, the result of one treatment
on one person, and each column is a variable. Note that in this experiment, the missing value represents
an observation that should have been made, but was not, so it is important to keep it. Structural missing
values, which represent measurements that cannot be made (e.g., the count of pregnant males) can be safely
removed.
For a given dataset, it is usually easy to figure out what are observations and what are variables, but it is
surprisingly difficult to precisely define variables and observations in general. For example, if the columns in
the Table 11 were height and weight we would have been happy to call them variables. If the columns were
height and width, it would be less clear cut, as we might think of height and width as values of a dimension
variable.
Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a
standard way of structuring a dataset. Compare Table 13 to Table 11: in Table 11 you need to use different
strategies to extract different variables. This slows down the analysis and invites errors. If you consider how
many data analysis operations involve all of the values in a variable (every aggregation function), you can see
how important it is to extract these values in a simple, standard way. Tidy data is particularly well suited
for vectorized programming languages like R, because the layout ensures that values of different variables
from the same observation are always paired.
While the order of variables and observations does not affect analysis, a good ordering makes it easier to scan
the raw values. One way of organizing variables is by their role in the analysis:
85
• are values fixed by the design of the data collection? Fixed variables describe the experimental design
and are known in advance (such as person and treatment in Table 3). Computer scientists often call
fixed variables dimensions, and statisticians usually denote them with subscripts on random variables.
• are they measured during the course of the experiment? Measured variables are what we actually
measure in the study (such as result in Table 3).
Fixed variables should come first, followed by measured variables, each ordered so that related variables are
contiguous. Rows can then be ordered by the first variable, breaking ties with the second and subsequent
(fixed) variables.
table3
#> # A tibble: 6 × 3
#> country year rate
#> * <chr> <int> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
86
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
Or spread across two tibbles:
table4a
#> # A tibble: 3 × 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
table4b
#> # A tibble: 3 × 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 19987071 20595360
#> 2 Brazil 172006362 174504898
#> 3 China 1272915272 1280428583
2.2.3.1 pivot_longer() The function pivot_longer() pivots or “lengthens” data by taking multiple
columns and collapsing them into key-value pairs, duplicating all other columns as needed. You use
pivot_longer() when you notice that you have columns that are not variables. This operation is often
described as making wide datasets long or tall.
pivot_longer(data, cols, names_to = "name", values_to = "value", ...)
In order to understand this function one must understand the following arguments:
• data A data frame to pivot.
• cols: Columns to pivot into longer format.
• names_to: A character vector specifying the new column or columns to create from the information
stored in the column names of data specified by cols.
• values_to: A string specifying the name of the column to create from the data stored in cell values.
Consider the following example data set. The column names actually would correspond to the values of
variable year:
87
table4a
# A tibble: 3 x 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
We can use pivot_longer() to make the data tidy
table4a %>%
pivot_longer(col = c(`1999`, `2000`),
names_to = "year", values_to = "cases")
# A tibble: 6 x 3
country year cases
<chr> <chr> <dbl>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
Exercise
The layout of the following data frame called stocks may be accessible to the human eye, however it is not
tidy according to the rules laid out earlier.
# From http://stackoverflow.com/questions/1181060
stocks <- tibble(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
In this example X, Y, and Z are not different variables but values of a variable called stock. How can we
tidy the data?
stocks %>%
pivot_longer(cols = c("X", "Y", "Z"),
names_to = "stock", values_to = "price")
# A tibble: 30 x 3
time stock price
<date> <chr> <dbl>
1 2009-01-01 X 2.02
2 2009-01-01 Y 2.38
3 2009-01-01 Z 1.03
4 2009-01-02 X 0.738
5 2009-01-02 Y -1.79
6 2009-01-02 Z 6.53
7 2009-01-03 X -1.69
8 2009-01-03 Y -1.53
88
9 2009-01-03 Z 3.48
10 2009-01-04 X 0.478
# i 20 more rows
2.2.3.2 pivot_wider() Function pivot_wider() “widens” or spreads data, increasing the number of
columns and decreasing the number of rows. It is the inverse transformation of pivot_longer(). You use
it when an observation is scattered across multiple rows. This operation is often described as making long
datasets wide.
pivot_wider(data, id_cols = NULL, id_expand = FALSE,
names_from = "name", values_from = "value", ...)
The most important arguments are
• data: A data frame to pivot
• names_from, values_from: A pair of arguments describing which column (or columns) to get the
name of the output column (names_from), and which column (or columns) to get the cell values from
(values_from).
Consider the data set:
table2
# A tibble: 12 x 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
An observation is a country in a year, but each observation is spread across two rows.
table2 %>%
pivot_wider(names_from = "type", values_from = "count")
# A tibble: 6 x 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Exercise
89
Why does this code fail?
table4a %>%
pivot_longer(c(1999, 2000),
names_to = "year", values_to = "cases")
Exercise
Consider the stocks data in long format from last exercise.
stocksm <- stocks %>%
pivot_longer(cols = c("X", "Y", "Z"),
names_to = "stock", values_to = "price")
In our analysis we want to consider a stock an observation and the price at a given point in time is considered
a variable. How can we use the pivot_wider() function to create the proper formatting?
stocksm %>%
pivot_wider(names_from = "time", values_from = "price")
# A tibble: 3 x 11
stock `2009-01-01` `2009-01-02` `2009-01-03` `2009-01-04` `2009-01-05`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 X 2.02 0.738 -1.69 0.478 0.0747
2 Y 2.38 -1.79 -1.53 1.97 1.83
3 Z 1.03 6.53 3.48 4.87 6.88
# i 5 more variables: `2009-01-06` <dbl>, `2009-01-07` <dbl>,
# `2009-01-08` <dbl>, `2009-01-09` <dbl>, `2009-01-10` <dbl>
2.2.3.3 separate() This function pulls apart one column into multiple columns by splitting wherever a
separator character appears.
separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE,
convert = FALSE, extra = "warn", fill = "warn", ...)
The most important arguments are
• data the data frame
• col the column name or position
• into the names of the new variables to create as character vector
• sep the separator between columns.
Consider table3 above:
table3
# A tibble: 6 x 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
90
The rate contains both cases and population variables. We, thus, need to split it into two variables.
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
# A tibble: 6 x 4
country year cases population
<chr> <dbl> <chr> <chr>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
2.2.3.4 unite() This is the inverse of separate(). It combines multiple columns into a single column.
unite(data, col, ..., sep = "_", remove = TRUE)
Its most important arguments are
• data the data frame
• col the name of the new column, as a string or symbol.
• ... A selection of columns. If empty, all variables are selected. You can supply bare variable names,
select all variables between x and z with x:z, exclude y with -y.
• sep Separator to use between values.
Let’s assume we want to change table1 into table3:
table1
# A tibble: 6 x 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
The cases and population columns can be united with the unite() function.
table1 %>%
unite(rate, cases, population, sep="/")
# A tibble: 6 x 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
91
2.2.3.5 Missing values Changing the representation of a dataset (from wide to long and viceversa)
brings up the issue of how missings values should be dealt with. As mentioned before, a value can be missing
• Explicitly, i.e. flagged with NA.
• Implicitly, i.e. simply not present in the data.
Let’s again look at one example:
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
We have an explicit missing value for year 2015 in quarter 4. But at a closer inspection we actually see that
we have one more implicit missing value 2016 quarter 1.
A function for making missing values explicit in tidy data is complete().
stocks %>%
complete(year, qtr)
# A tibble: 8 x 3
year qtr return
<dbl> <dbl> <dbl>
1 2015 1 1.88
2 2015 2 0.59
3 2015 3 0.35
4 2015 4 NA
5 2016 1 NA
6 2016 2 0.92
7 2016 3 0.17
8 2016 4 2.66
The way that a dataset is represented can make implicit values explicit. For example, we can make the
implicit missing value explicit by putting years in the columns:
stocks_wide <- pivot_wider(stocks, names_from = "qtr", values_from = "return")
stocks_wide
# A tibble: 2 x 5
year `1` `2` `3` `4`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2015 1.88 0.59 0.35 NA
2 2016 NA 0.92 0.17 2.66
If we are not interesting in keeping any missing values in the long data format, we can set the argument
values_drop_na = TRUE in pivot_longer():
pivot_longer(stocks_wide, cols = c(`1`, `2`, `3`, `4`),
names_to = "qtr", values_to = "return",
values_drop_na = TRUE)
# A tibble: 6 x 3
year qtr return
<dbl> <chr> <dbl>
1 2015 1 1.88
2 2015 2 0.59
3 2015 3 0.35
92
4 2016 2 0.92
5 2016 3 0.17
6 2016 4 2.66
93
$ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
# A tibble: 842 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# i 832 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
or with the pipe operator:
flights %>%
filter(month == 1, day == 1)
# A tibble: 842 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# i 832 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
In base R we do this using subsetting:
flights[(flights$month == 1 & flights$day == 1), ]
# A tibble: 842 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
94
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# i 832 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
or using the function subset:
subset(flights, month == 1 & day == 1)
# A tibble: 842 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# i 832 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Exercises
1. Find all flights that:
a. Had an arrival delay of two or more hours
b. Flew to Houston (IAH or HOU)
c. Were operated by United (UA), American (AA), or Delta (DL)
d. Departed in summer (July, August, and September)
e. Arrived more than two hours late, but didn’t leave late
f. Were delayed by at least an hour, but made up over 30 minutes in flight
g. Departed between midnight and 6am (included)
2. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the
code needed to answer the previous challenges?
3. How many flights have a missing dep_time? What other variables are missing? What might these rows
represent?
95
2.3.3 Arrange rows with arrange()
Function arrange() works similarly to filter() except that instead of selecting rows, it changes their order.
It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide
more than one column name, each additional column will be used to break ties in the values of preceding
columns:
arrange(flights, year, month, day)
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# i 336,766 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Use desc() to re-order by a column in descending order:
arrange(flights, desc(arr_delay))
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 9 641 900 1301 1242 1530
2 2013 6 15 1432 1935 1137 1607 2120
3 2013 1 10 1121 1635 1126 1239 1810
4 2013 9 20 1139 1845 1014 1457 2210
5 2013 7 22 845 1600 1005 1044 1815
6 2013 4 10 1100 1900 960 1342 2211
7 2013 3 17 2321 810 911 135 1020
8 2013 7 22 2257 759 898 121 1026
9 2013 12 5 756 1700 896 1058 2020
10 2013 5 3 1133 2055 878 1250 2215
# i 336,766 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Missing values are always sorted at the end:
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
# A tibble: 3 x 1
x
<dbl>
1 2
96
2 5
3 NA
arrange(df, desc(x))
# A tibble: 3 x 1
x
<dbl>
1 5
2 2
3 NA
# A tibble: 3 x 1
x
<dbl>
1 2
2 5
3 NA
df[order(df$x, decreasing = TRUE), ]
# A tibble: 3 x 1
x
<dbl>
1 5
2 2
3 NA
Exercises
1. How could you use arrange() to sort all missing values to the top? (Hint: use is.na()).
2. Sort flights to find the most delayed flights. Find the flights that left earliest.
3. Sort flights to find the fastest flights.
4. Which flights travelled the longest? Which travelled the shortest?
97
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# i 336,766 more rows
select(flights, year:day)
# A tibble: 336,776 x 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# i 336,766 more rows
select(flights, -(year:day))
# A tibble: 336,776 x 16
dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
<int> <int> <dbl> <int> <int> <dbl> <chr>
1 517 515 2 830 819 11 UA
2 533 529 4 850 830 20 UA
3 542 540 2 923 850 33 AA
4 544 545 -1 1004 1022 -18 B6
5 554 600 -6 812 837 -25 DL
6 554 558 -4 740 728 12 UA
7 555 600 -5 913 854 19 B6
8 557 600 -3 709 723 -14 EV
9 557 600 -3 838 846 -8 B6
10 558 600 -2 753 745 8 AA
# i 336,766 more rows
# i 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
There are a number of helper functions you can use within select():
• starts_with("abc"): matches names that begin with “abc”.
• ends_with("xyz"): matches names that end with “xyz”.
• contains("ijk"): matches names that contain “ijk”.
• matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables
that contain repeated characters.
• num_range("x", 1:3) matches x1, x2 and x3.
See ?select for more details.
select() can be used to rename variables, but it’s rarely useful because it drops all of the variables not
explicitly mentioned. Instead, use rename(), which is a variant of select() that keeps all the variables that
aren’t explicitly mentioned:
rename(flights, tail_num = tailnum)
# A tibble: 336,776 x 19
98
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# i 336,766 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tail_num <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Another option is to use select() in conjunction with the everything() helper. This is useful if you have a
handful of variables you’d like to move to the start of the data frame.
select(flights, time_hour, air_time, everything())
# A tibble: 336,776 x 19
time_hour air_time year month day dep_time sched_dep_time
<dttm> <dbl> <int> <int> <int> <int> <int>
1 2013-01-01 05:00:00 227 2013 1 1 517 515
2 2013-01-01 05:00:00 227 2013 1 1 533 529
3 2013-01-01 05:00:00 160 2013 1 1 542 540
4 2013-01-01 05:00:00 183 2013 1 1 544 545
5 2013-01-01 06:00:00 116 2013 1 1 554 600
6 2013-01-01 05:00:00 150 2013 1 1 554 558
7 2013-01-01 06:00:00 158 2013 1 1 555 600
8 2013-01-01 06:00:00 53 2013 1 1 557 600
9 2013-01-01 06:00:00 140 2013 1 1 557 600
10 2013-01-01 06:00:00 138 2013 1 1 558 600
# i 336,766 more rows
# i 12 more variables: dep_delay <dbl>, arr_time <int>, sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>
Exercises
1. Use different options to select dep_time, dep_delay, arr_time, and arr_delay from flights.
2. What happens if you include the name of a variable multiple times in a select() call? What about if
we select it twice using named indexing i.e., flights[, c("dep_time", "dep_time")]?
3. Does the result of running the following code surprise you?
select(flights, contains("TIME"))
# A tibble: 336,776 x 6
dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
<int> <int> <int> <int> <dbl> <dttm>
1 517 515 830 819 227 2013-01-01 05:00:00
2 533 529 850 830 227 2013-01-01 05:00:00
99
3 542 540 923 850 160 2013-01-01 05:00:00
4 544 545 1004 1022 183 2013-01-01 05:00:00
5 554 600 812 837 116 2013-01-01 06:00:00
6 554 558 740 728 150 2013-01-01 05:00:00
7 555 600 913 854 158 2013-01-01 06:00:00
8 557 600 709 723 53 2013-01-01 06:00:00
9 557 600 838 846 140 2013-01-01 06:00:00
10 558 600 753 745 138 2013-01-01 06:00:00
# i 336,766 more rows
How do the select helpers deal with upper case by default? How can you change that default?
select(flights, contains("TIME", ignore.case = FALSE))
# A tibble: 336,776 x 9
year month day dep_delay arr_delay distance air_time gain speed
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400 227 9 370.
2 2013 1 1 4 20 1416 227 16 374.
3 2013 1 1 2 33 1089 160 31 408.
4 2013 1 1 -1 -18 1576 183 -17 517.
5 2013 1 1 -6 -25 762 116 -19 394.
6 2013 1 1 -4 12 719 150 16 288.
7 2013 1 1 -5 19 1065 158 24 404.
8 2013 1 1 -3 -14 229 53 -11 259.
9 2013 1 1 -3 -8 944 140 -5 405.
10 2013 1 1 -2 8 733 138 10 319.
# i 336,766 more rows
Note that you can refer to columns that you’ve just created (given that we are working with tibbles):
flights_sml %>%
mutate(
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
# A tibble: 336,776 x 10
100
year month day dep_delay arr_delay distance air_time gain hours
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400 227 9 3.78
2 2013 1 1 4 20 1416 227 16 3.78
3 2013 1 1 2 33 1089 160 31 2.67
4 2013 1 1 -1 -18 1576 183 -17 3.05
5 2013 1 1 -6 -25 762 116 -19 1.93
6 2013 1 1 -4 12 719 150 16 2.5
7 2013 1 1 -5 19 1065 158 24 2.63
8 2013 1 1 -3 -14 229 53 -11 0.883
9 2013 1 1 -3 -8 944 140 -5 2.33
10 2013 1 1 -2 8 733 138 10 2.3
# i 336,766 more rows
# i 1 more variable: gain_per_hour <dbl>
If you only want to keep the new variables, use transmute():
flights %>%
transmute(
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
# A tibble: 336,776 x 3
gain hours gain_per_hour
<dbl> <dbl> <dbl>
1 9 3.78 2.38
2 16 3.78 4.23
3 31 2.67 11.6
4 -17 3.05 -5.57
5 -19 1.93 -9.83
6 16 2.5 6.4
7 24 2.63 9.11
8 -11 0.883 -12.5
9 -5 2.33 -2.14
10 10 2.3 4.35
# i 336,766 more rows
There are many functions for creating new variables that you can use with mutate(). The key property is
that the function must be vectorised: it must take a vector of values as input, return a vector with the same
number of values as output.
In base R you can add a new column by e.g., using the $ selection operator (see 1.6.6):
flights_sml$gain <- flights_sml$arr_delay - flights_sml$dep_delay
Exercises
1. Currently, dep_time and sched_dep_time are convenient to look at, but hard to compute with because
they’re not really continuous variables. Convert them to a more convenient representation of number of
minutes since midnight.
101
2. Compare air_time with arr_time - dep_time. What do you see?
3. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to
be related?
# A tibble: 1 x 1
delay
<dbl>
1 12.6
or with the pipe operator:
flights %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 x 1
delay
<dbl>
1 12.6
summarise() is not terribly useful unless we pair it with group_by(). This changes the unit of analysis from
the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame
they will be automatically applied group-wise. For example, if we applied exactly the same code to a data
frame grouped by date, we get the average delay per date:
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
# A tibble: 365 x 4
# Groups: year, month [12]
year month day delay
<int> <int> <int> <dbl>
1 2013 1 1 11.5
2 2013 1 2 13.9
3 2013 1 3 11.0
4 2013 1 4 8.95
5 2013 1 5 5.73
6 2013 1 6 7.15
7 2013 1 7 5.42
8 2013 1 8 2.55
9 2013 1 9 2.28
10 2013 1 10 2.84
# i 355 more rows
Together group_by() and summarise() provide one of the tools that you’ll use most commonly when working
with dplyr: grouped summaries.
102
2.3.7 Combining multiple operations with the pipe
The elegance of these transformation functions manifests when used in combination with the pipe operator:
flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
# A tibble: 96 x 4
dest count dist delay
<chr> <int> <dbl> <dbl>
1 ABQ 254 1826 4.38
2 ACK 265 199 4.85
3 ALB 439 143 14.4
4 ATL 17215 757. 11.3
5 AUS 2439 1514. 6.02
6 AVL 275 584. 8.00
7 BDL 443 116 7.05
8 BGR 375 378 8.03
9 BHM 297 866. 16.9
10 BNA 6333 758. 11.8
# i 86 more rows
`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
# A tibble: 365 x 5
# Groups: year, month [12]
year month day avg_delay1 avg_delay2
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 12.7 32.5
2 2013 1 2 12.7 32.0
3 2013 1 3 5.73 27.7
4 2013 1 4 -1.93 28.3
5 2013 1 5 -1.53 22.6
6 2013 1 6 4.24 24.4
7 2013 1 7 -4.95 27.8
8 2013 1 8 -3.23 20.8
9 2013 1 9 -0.264 25.6
10 2013 1 10 -5.90 27.3
103
# i 355 more rows
• Measure of spread: sd(x), IQR(x), mad(x). The empirical standard deviation or sd for short, is the
standard measure of spread. The interquartile range IQR() and median absolute deviation mad(x) are
robust equivalents that may be more useful if you have outliers.
flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
group_by(dest) %>%
summarise(distance_sd = sd(distance),
distance_iqr = IQR(distance),
distance_mad = mad(distance)) %>%
arrange(desc(distance_sd))
# A tibble: 104 x 4
dest distance_sd distance_iqr distance_mad
<chr> <dbl> <dbl> <dbl>
1 EGE 10.5 21 1.48
2 SAN 10.4 21 0
3 SFO 10.2 21 0
4 HNL 10.0 20 0
5 SEA 9.98 20 0
6 LAS 9.91 21 0
7 PDX 9.87 20 0
8 PHX 9.86 20 0
9 LAX 9.66 21 0
10 IND 9.46 20 0
# i 94 more rows
• Measures of position: first(x), nth(x, 2), last(x). For example, we can find the first and last
departure for each day:
flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time),
last_dep = last(dep_time)
)
`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
# A tibble: 365 x 5
# Groups: year, month [12]
year month day first_dep last_dep
<int> <int> <int> <int> <int>
1 2013 1 1 517 2356
2 2013 1 2 42 2354
3 2013 1 3 32 2349
4 2013 1 4 25 2358
5 2013 1 5 14 2357
6 2013 1 6 16 2355
7 2013 1 7 49 2359
8 2013 1 8 454 2351
9 2013 1 9 2 2252
10 2013 1 10 3 2320
104
# i 355 more rows
• counts: n() takes no arguments, and returns the size of the current group. To count the number
of non-missing values, use sum(!is.na(x)). To count the number of distinct (unique) values, use
n_distinct(x).
flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
# A tibble: 104 x 2
dest carriers
<chr> <int>
1 ATL 7
2 BOS 7
3 CLT 7
4 ORD 7
5 TPA 7
6 AUS 6
7 DCA 6
8 DTW 6
9 IAD 6
10 MSP 6
# i 94 more rows
When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy
to progressively roll up a dataset:
daily <- group_by(flights, year, month, day)
(per_day <- summarise(daily, flights = n()))
(per_month <- summarise(per_day, flights = sum(flights)))
(per_year <- summarise(per_month, flights = sum(flights)))
Exercise:
1. Which carrier has the biggest delays?
2. Which flight is always at least 10 minutes late?
105
flights
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# i 336,766 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
airlines
# A tibble: 16 x 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
4 B6 JetBlue Airways
5 DL Delta Air Lines Inc.
6 EV ExpressJet Airlines Inc.
7 F9 Frontier Airlines Inc.
8 FL AirTran Airways Corporation
9 HA Hawaiian Airlines Inc.
10 MQ Envoy Air
11 OO SkyWest Airlines Inc.
12 UA United Air Lines Inc.
13 US US Airways Inc.
14 VX Virgin America
15 WN Southwest Airlines Co.
16 YV Mesa Airlines Inc.
airports
# A tibble: 1,458 x 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/~
2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/~
3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/~
4 06N Randall Airport 41.4 -74.4 523 -5 A America/~
5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/~
6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/~
7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/~
8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/~
9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/~
106
10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/~
# i 1,448 more rows
planes
# A tibble: 3,322 x 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N10156 2004 Fixed wing multi~ EMBRAER EMB-~ 2 55 NA Turbo~
2 N102UW 1998 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
3 N103US 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
4 N104UW 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
5 N10575 2002 Fixed wing multi~ EMBRAER EMB-~ 2 55 NA Turbo~
6 N105UW 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
7 N107US 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
8 N108UW 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
9 N109UW 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
10 N110UW 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
# i 3,312 more rows
weather
# A tibble: 26,115 x 15
origin year month day hour temp dewp humid wind_dir wind_speed
<chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
# i 26,105 more rows
# i 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
# visib <dbl>, time_hour <dttm>
Exercise
Inspect how the 5 tables are connected.
2.3.9.1 Keys The variables used to connect each pair of tables are called keys. A key is a variable (or
set of variables) that uniquely identifies an observation. For example, * each plane in planes is uniquely
identified by its tailnum. * to identify an observation in weather you need five variables: year, month, day,
hour, and origin, so the key consists to 5 variables.
To check whether the identified primary keys are indeed correct, there should be only one observation for
each key in the table:
planes %>%
count(tailnum) %>% # how many rows per key
filter(n > 1) # are there any with more rows per key?
107
# A tibble: 0 x 2
# i 2 variables: tailnum <chr>, n <int>
weather %>%
count(year, month, day, hour, origin) %>% # how many rows per key
filter(n > 1) # are there any with more rows per key?
# A tibble: 3 x 6
year month day hour origin n
<int> <int> <int> <int> <chr> <int>
1 2013 11 3 1 EWR 2
2 2013 11 3 1 JFK 2
3 2013 11 3 1 LGA 2
For weather, one would need to further dig into these 3 cases where year, month, day, hour, and origin
are not primary keys.
There are two types of keys:
• A primary key uniquely identifies an observation in its own table. For example, planes$tailnum is a
primary key because it uniquely identifies each plane in the planes table.
• A foreign key uniquely identifies an observation in another table. For example, flights$tailnum is a
foreign key because it appears in the flights table where it matches each flight to a unique plane.
2.3.9.2 Mutating joins A mutating join allows you to combine variables from two tables. It first matches
observations by their keys, then copies across variables from one table to the other.
Types of joins:
• inner join: keeps observations that appear in both tables
• outer join: outer join keeps observations that appear in at least one of the tables. There are three types
of outer joins:
– left join keeps all observations in x.
– right join keeps all observations in y.
– full join keeps all observations in x and y
108
As an example consider two tibbles which contain grades for students in two different courses. The key for
both tables in the student names:
x <- tibble(student = c("Jim","Sarah","Mike"),
grade_x = c(1, 4, 2))
y <- tibble(student = c("Jim","Sarah","Julia"),
grade_y = c(1, 5, 1),
majorS = c("Epi","PH","Epi"))
Inner join will contain students in both tables only (i.e., Jim and Sarah):
x %>%
inner_join(y, by = "student")
# A tibble: 2 x 4
student grade_x grade_y majorS
<chr> <dbl> <dbl> <chr>
1 Jim 1 1 Epi
2 Sarah 4 5 PH
Left join will keep all students in x:
left_join(x, y, by = "student")
# A tibble: 3 x 4
student grade_x grade_y majorS
<chr> <dbl> <dbl> <chr>
1 Jim 1 1 Epi
2 Sarah 4 5 PH
3 Mike 2 NA <NA>
Right join will keep all students in y:
right_join(x, y, by = "student")
# A tibble: 3 x 4
student grade_x grade_y majorS
<chr> <dbl> <dbl> <chr>
1 Jim 1 1 Epi
2 Sarah 4 5 PH
3 Julia NA 1 Epi
Full join will keep all students in x and all students in y:
full_join(x, y, by = "student")
# A tibble: 4 x 4
student grade_x grade_y majorS
<chr> <dbl> <dbl> <chr>
1 Jim 1 1 Epi
2 Sarah 4 5 PH
3 Mike 2 NA <NA>
4 Julia NA 1 Epi
Assume the key column in table y would be named student_name.
y <- tibble(student_name = c("Jim","Sarah","Julia"),
grade_y = c(1, 5, 1),
majorS = c("Epi","PH","Epi"))
109
Then the join call would be
x %>%
inner_join(y, by = c("student" = "student_name"))
# A tibble: 2 x 4
student grade_x grade_y majorS
<chr> <dbl> <dbl> <chr>
1 Jim 1 1 Epi
2 Sarah 4 5 PH
Duplicate keys So far we assumed that the keys are unique. But that’s not always the case. There are
two possibilities:
1. One table has duplicate keys. This is useful when you want to add in additional information as there
is typically a one-to-many relationship. Assume that table x contains grades in subject x for both
mid-terms and final-terms.
x <- tibble(student = c("Jim","Jim","Sarah","Sarah", "Mike"),
grade_x = c(1, 2, 4, 3, 2))
y <- tibble(student = c("Jim","Sarah","Julia"),
grade_y = c(1, 5, 1),
majorS = c("Epi","PH","Epi"))
left_join(x, y, by = "student")
# A tibble: 5 x 4
student grade_x grade_y majorS
<chr> <dbl> <dbl> <chr>
1 Jim 1 1 Epi
2 Jim 2 1 Epi
3 Sarah 4 5 PH
4 Sarah 3 5 PH
5 Mike 2 NA <NA>
2. Both tables have duplicate keys. This is usually an error because in neither table do the keys uniquely
identify an observation. When you join duplicated keys, you get all possible combinations, the Cartesian
product:
x <- tibble(student = c("Jim","Jim","Sarah","Sarah", "Mike"),
grade_x = c(1, 2, 4, 3, 2))
y <- tibble(student = c("Jim","Sarah","Sarah","Julia"),
grade_y = c(1, 5, 4, 1),
majorS = c("Epi","PH", "PH", "Epi"))
left_join(x, y, by = "student")
110
4 Sarah 4 4 PH
5 Sarah 3 5 PH
6 Sarah 3 4 PH
7 Mike 2 NA <NA>
2.3.9.3 Filtering joins Filtering joins match observations in the same way as mutating joins, but affect
the observations, not the variables. There are two types:
• semi_join(x, y) keeps all observations in x that have a match in y.
• anti_join(x, y) drops all observations in x that have a match in y.
Semi-joins are useful for matching filtered summary tables back to the original rows. For example, imagine
you have found the top ten most popular destinations in flights:
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
top_dest
# A tibble: 10 x 2
dest n
<chr> <int>
1 ORD 17283
2 ATL 17215
3 LAX 16174
4 BOS 15508
5 MCO 14082
6 CLT 14064
7 SFO 13331
8 FLL 12055
9 MIA 11728
10 DCA 9705
If you want to find each flight that went to one of those destinations you can use semi_join
flights %>%
semi_join(top_dest)
111
Note that filtering joins never duplicate rows like mutating joins do. This is the main difference between
semi_join and inner_join.
The inverse of a semi-join is an anti-join. An anti-join keeps the rows that don’t have a match:
flights %>%
anti_join(top_dest) # contains all flights to dest other than top 10
2.4.1 case_when
The case_when function allows you to vectorize multiple if and if else statements. It is the tidyverse
equivalent of switch but has a more “robust” behavior (remember, using switch is recommended with
character inputs).
As an example, let’s make a new variable in data set flights which categorizes departures into morning
(05:01-11:00), noon (11:01-13:00), afternoon (13:01-18:00), evening (18:01-23:00), night (23:01-05:00).
flights_new <- flights %>%
mutate(new_var = case_when(
dep_time > 500 & dep_time <= 1100 ~ "morning",
dep_time > 1100 & dep_time <= 1300 ~ "noon",
dep_time > 1300 & dep_time <= 1800 ~ "afternoon",
dep_time > 1800 & dep_time <= 2300 ~ "evening",
dep_time > 2300 | dep_time <= 500 ~ "night"
))
112
flights_new <- flights_new %>%
mutate(new_var = as.factor(new_var))
levels(flights_new$new_var)
# A tibble: 6 x 4
new_var delay_departure delay_arrival n
<fct> <dbl> <dbl> <int>
1 afternoon 11.9 6.58 104679
2 evening 28.8 22.5 73776
3 morning 1.24 -4.58 113075
4 night 105. 97.8 4068
5 noon 6.58 1.15 32923
6 <NA> NaN NaN 8255
Ideally, we would like the table to be ordered by time of day not alphanumerically but chronologically. For
this, one would need to reorder the levels of the factor variable. For this fct_relevel can be used.
library("forcats")
flights_summary <- flights_new %>%
mutate(new_var = fct_relevel(new_var, "morning", "noon")) %>%
group_by(new_var) %>%
summarise(
delay_departure = mean(dep_delay, na.rm = TRUE),
delay_arrival = mean(arr_delay, na.rm = TRUE),
n = n()
)
flights_summary
# A tibble: 6 x 4
new_var delay_departure delay_arrival n
<fct> <dbl> <dbl> <int>
1 morning 1.24 -4.58 113075
2 noon 6.58 1.15 32923
3 afternoon 11.9 6.58 104679
4 evening 28.8 22.5 73776
5 night 105. 97.8 4068
6 <NA> NaN NaN 8255
One can also order a summary statistic. This is useful in some cases for plotting. For example, we want the
levels to be ordered by their frequency, fct_infreq() can be used.
113
flights_new %>%
mutate(new_var = fct_infreq(new_var)) %>%
count(new_var)
# A tibble: 6 x 2
new_var n
<fct> <int>
1 morning 113075
2 afternoon 104679
3 evening 73776
4 noon 32923
5 night 4068
6 <NA> 8255
One can also order the levels by a summary statistic. This is useful in some cases for plotting. For example,
we want the time of day to be ordered by the average or the maximum distance of the flights departing in
the that time frame. Function fct_reorder(f, x) orders the levels of f based on the median value of x (by
default) and fct_reorder2(f, x) reorders the factor by values associated with the largest x values
flights_new %>%
mutate(new_var = fct_reorder(new_var, distance)) %>%
group_by(new_var) %>%
summarise(avg_dist = mean(distance, na.rm = TRUE),
med_dist = median(distance, na.rm = TRUE),
n = n())
# A tibble: 6 x 4
new_var avg_dist med_dist n
<fct> <dbl> <dbl> <int>
1 noon 952. 762 32923
2 night 942. 764 4068
3 evening 1052. 866 73776
4 afternoon 1040. 872 104679
5 morning 1086. 944 113075
6 <NA> 695. 583 8255
Note that the NA level is always last.
More powerful than changing the orders of the levels is changing their values. The most useful function for
this purpose is fct_recode().
flights_new %>%
mutate(new_var = fct_recode(new_var,
"Noon (starting with 11:01)" = "noon",
"Afternoon (starting with 13:01)" = "afternoon",
"Morning (starting with 05:01)" = "morning",
"Evening (starting with 18:01)" = "evening",
"Night (starting with 23:01)" = "night"
)) %>%
count(new_var)
# A tibble: 6 x 2
new_var n
<fct> <int>
1 Afternoon (starting with 13:01) 104679
2 Evening (starting with 18:01) 73776
3 Morning (starting with 05:01) 113075
114
4 Night (starting with 23:01) 4068
5 Noon (starting with 11:01) 32923
6 <NA> 8255
For recoding one can of course use also base R functions. Here is an example, though it seems more
cumbersome.
table(flights_new$new_var)
# a function to capitalize the word ...
# One can use regular expressions, if familiar with them.
capitalize <- function(s) {
paste0(toupper(substring(s, 1, 1)),
substring(s, 2))
}
flights_new$new_var1 <- flights_new$new_var
levels(flights_new$new_var1) <-
paste0(capitalize(levels(flights_new$new_var1)),
" (starting with ", c("13", "18", "5", "23", "11"), ":01)")
To combine groups, you can assign multiple old levels to the same new level:
flights_new %>%
mutate(new_var = fct_recode(new_var,
"Noon/Afternoon (starting with 11:01)" = "noon",
"Noon/Afternoon (starting with 11:01)" = "afternoon",
"Morning (starting with 05:01)" = "morning",
"Evening (starting with 18:01)" = "evening",
"Night (starting with 23:01)" = "night"
)) %>%
count(new_var)
# A tibble: 5 x 2
new_var n
<fct> <int>
1 Noon/Afternoon (starting with 11:01) 137602
2 Evening (starting with 18:01) 73776
3 Morning (starting with 05:01) 113075
4 Night (starting with 23:01) 4068
5 <NA> 8255
If you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode().
flights_new %>%
mutate(new_var = fct_collapse(new_var,
"Noon/Afternoon (starting with 11:01)" = c("noon", "afternoon"),
"Night/Morning (starting with 23:01)" = c("morning", "night"),
"Evening (starting with 18:01)" = "evening"
)) %>%
count(new_var)
# A tibble: 4 x 2
new_var n
<fct> <int>
1 Noon/Afternoon (starting with 11:01) 137602
2 Evening (starting with 18:01) 73776
3 Night/Morning (starting with 23:01) 117143
4 <NA> 8255
115
Finally, if you just want to lump together all the small groups to make a plot or table simpler, you can use
fct_lump():
flights_new %>%
mutate(new_var = fct_lump(new_var)) %>%
count(new_var)
# A tibble: 5 x 2
new_var n
<fct> <int>
1 afternoon 104679
2 evening 73776
3 morning 113075
4 Other 36991
5 <NA> 8255
The default behavior is to progressively lump together the smallest groups, ensuring that the aggregate is
still the smallest group. One can choose the group size for which levels should be lumped together by the
setting the n argument to an integer:
flights_new %>%
mutate(new_var = fct_lump(new_var, n = 5000)) %>%
count(new_var)
# A tibble: 6 x 2
new_var n
<fct> <int>
1 afternoon 104679
2 evening 73776
3 morning 113075
4 night 4068
5 noon 32923
6 <NA> 8255
Exercise
What are the NAs in new_var? Can you assign them a meaning?
[1] "2025-03-27"
now()
116
month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the
name of the lubridate function that will parse your date. For example:
ymd("2021-02-02")
[1] "2021-02-02"
mdy("May 10, 2022")
[1] "2022-05-10"
dmy("31-Jan-2017")
[1] "2017-01-31"
To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing
function:
dmy_hm("31-Jan-2017 08:01")
# A tibble: 336,776 x 6
year month day hour minute departure
<int> <int> <int> <dbl> <dbl> <dttm>
1 2013 1 1 5 15 2013-01-01 05:15:00
2 2013 1 1 5 29 2013-01-01 05:29:00
3 2013 1 1 5 40 2013-01-01 05:40:00
4 2013 1 1 5 45 2013-01-01 05:45:00
5 2013 1 1 6 0 2013-01-01 06:00:00
6 2013 1 1 5 58 2013-01-01 05:58:00
7 2013 1 1 6 0 2013-01-01 06:00:00
8 2013 1 1 6 0 2013-01-01 06:00:00
9 2013 1 1 6 0 2013-01-01 06:00:00
10 2013 1 1 6 0 2013-01-01 06:00:00
# i 336,766 more rows
The departure and arrival times are in a rather weird format. To create proper data-times we would need
something like:
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(
dep_time = make_datetime(year, month, day, dep_time %/% 100, dep_time %% 100),
arr_time = make_datetime(year, month, day, arr_time %/% 100, arr_time %% 100)
) %>%
select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt
# A tibble: 328,063 x 9
117
origin dest dep_delay arr_delay dep_time sched_dep_time
<chr> <chr> <dbl> <dbl> <dttm> <int>
1 EWR IAH 2 11 2013-01-01 05:17:00 515
2 LGA IAH 4 20 2013-01-01 05:33:00 529
3 JFK MIA 2 33 2013-01-01 05:42:00 540
4 JFK BQN -1 -18 2013-01-01 05:44:00 545
5 LGA ATL -6 -25 2013-01-01 05:54:00 600
6 EWR ORD -4 12 2013-01-01 05:54:00 558
7 EWR FLL -5 19 2013-01-01 05:55:00 600
8 LGA IAD -3 -14 2013-01-01 05:57:00 600
9 JFK MCO -3 -8 2013-01-01 05:57:00 600
10 LGA ORD -2 8 2013-01-01 05:58:00 600
# i 328,053 more rows
# i 3 more variables: arr_time <dttm>, sched_arr_time <int>, air_time <dbl>
You may want to switch between a date-time and a date. That is the job of as_datetime() and as_date():
as_datetime(today())
[1] "2025-03-27"
Sometimes you will get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in
seconds, use as_datetime(); if it is in days, use as_date().
as_datetime(60 * 60 * 10)
[1] "1980-01-01"
One can also use the accessor functions that let us get individual components.
datetime <- ymd_hms("2016-07-08 12:34:56")
year(datetime)
[1] 2016
month(datetime)
[1] 7
month(datetime, label = TRUE)
[1] Jul
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
month(datetime, label = TRUE, abbr = FALSE)
[1] July
12 Levels: January < February < March < April < May < June < ... < December
mday(datetime)
[1] 8
yday(datetime)
118
[1] 190
wday(datetime)
[1] 6
wday(datetime, label = TRUE)
[1] Fri
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
wday(datetime, label = TRUE, abbr = FALSE)
[1] Friday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
hour(datetime)
[1] 12
minute(datetime)
[1] 34
second(datetime)
[1] 56
One can also perform operations on dates. For example, if we want to add one year to the scheduled departure
day in the flights data we can simply do:
flights_dt %>%
select(dep_time) %>%
mutate(dep_time_1_year = dep_time + years(1))
# A tibble: 328,063 x 2
dep_time dep_time_1_year
<dttm> <dttm>
1 2013-01-01 05:17:00 2014-01-01 05:17:00
2 2013-01-01 05:33:00 2014-01-01 05:33:00
3 2013-01-01 05:42:00 2014-01-01 05:42:00
4 2013-01-01 05:44:00 2014-01-01 05:44:00
5 2013-01-01 05:54:00 2014-01-01 05:54:00
6 2013-01-01 05:54:00 2014-01-01 05:54:00
7 2013-01-01 05:55:00 2014-01-01 05:55:00
8 2013-01-01 05:57:00 2014-01-01 05:57:00
9 2013-01-01 05:57:00 2014-01-01 05:57:00
10 2013-01-01 05:58:00 2014-01-01 05:58:00
# i 328,053 more rows
Functions seconds(), minutes(), hours, days(), weeks() months() can be used similarly.
For more information consult Chapter 16, R for Data Science.
119
names, specify what to use for missing values, add or remove quotations around character strings, etc.
write.table(iris, file = "data/iris_copy.csv")
write.table(iris, file = "data/iris_copy.csv", sep = "\t",
row.names = FALSE, na = "MISSING!")
• If you have a very large data frame you want to export, then the function write.matrix of the package
MASS might be more suitable since it requires much less memory.
• readr write functions are about twice as fast and they do not write row names
# as csv
write_csv(iris, path = "data/iris_copy.csv")
# as excel
write_excel_csv(df, path = "df", col_names = FALSE)
• haven can be used to save other types of files such as .dta, .sav etc.
However, it is often of interest to export R object files (data or other types of objects such as lists, model
outputs etc.). There are three primary ways that people tend to save R data/objects: as .RData, .rda, or as
.rds files. .rda is just short for .RData, therefore, these file extensions represent the same underlying object
type. You use the .rda or .RData file types when you want to save several, or all, objects and functions that
exist in your global environment. On the other hand, if you only want to save a single R object such as a
data frame, function, or statistical model results its best to use .rds file type.
The following illustrates how you save R objects with each type.
# save() can be used to save multiple objects in you global environment,
# in this case I save two objects to a .RData file
x <- stats::runif(20)
y <- list(a = 1, b = TRUE, c = "oops")
save(x, y, file = "xy.RData")
As you can see, you can also use .rda or .RData to save a single object but the benefit of .rds is that it only
saves a representation of the object and not the name whereas .rda and .RData save the both the object and
its name. As a result, with .rds the saved object can be loaded into a named object within R that is different
from the name it had when originally saved.
# x <- load("xy.RData") this will not work
load("x.RData") # this will load x.
str(x) # I need to know the name of the saved object
120
3 Part 3: Visualization
Data visualization is an integral part of data analysis. It is used both for exploratory purposes as well as an
explanatory purposes. The first gives you insight about the data, lets you generate hypothesis about the data,
and allows human intuition to participate in the selection of a suitable model for your data. The second
purpose for graphical tools is to communicate the results of your analysis to others.
Chosing the right graph for your data and illustrating it in the most informative way is a form of art and
requires a lot of experience. For inspiration one can look at the following resources
• the R graph gallery https://www.r-graph-gallery.com/ (for all kinds of graphs)
• the vcd package for visualizing categorical data.
R has several graphic systems to produce graphs which follow different philosophies.
1. Base R graphics
2. Trellis plots (package lattice)
3. Grammar of graphics plots (package ggplot2)
4. Grid graphics
Here we will cover 1. and 3.
121
Histogram of x
30
Frequency
20
10
0
−4 −3 −2 −1 0 1 2 3
x
boxplot(x) produces a boxplot.
x <- rnorm(100)
boxplot(x)
2
1
0
−1
−2
[1] "density"
plot(x)
122
density(x = rnorm(100))
0.4
0.3
Density
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
−1
−2
−3
0 20 40 60 80 100
Index
stripchart(x) produces a 1D scatterplot.
x <- rnorm(50)
stripchart(x)
123
−2 −1 0 1 2 3
In order to compare the distribution with a normal or any other given distribution one can use qqnorm(x)
which plots the quantiles of x against the ones of the normal distribution.
y <- rt(200, df = 5)
qqnorm(y)
2
1
0
−1
−2
−3
−3 −2 −1 0 1 2 3
Theoretical Quantiles
qqplot(x,y) plots the quantiles of x against the quantiles of y.
124
y <- rt(200, df = 5)
qqplot(y, rt(300, df = 5))
2
rt(300, df = 5)
0
−2
−4
−6 −4 −2 0 2 4
3.1.1.2 Univariate categorical variables barplot(x) produces a bar plot if x is a factor. This yields
the same plot as plot(x).
fac <- factor(sample(1:3, 100, replace = TRUE), labels=c("red", "green", "blue"))
plot(fac)
10 15 20 25 30 35
5
0
125
red
green
blue
3.1.1.3 Covariation of two continuous variables plot(x,y) produces a scatter plot if x and y are
numeric.
x <- rnorm(100)
y <- rexp(100)
plot(x,y)
5
4
3
y
2
1
0
−2 −1 0 1 2
x
For several continuous variables we can plot pairwise scatterplots of all pairs of variables.
pairs(X) produces a scatter plot matrix if X is a matrix or data frame. This is equivalent to plot(X).
df <- data.frame("x" = rnorm(100), y = rpois(100, lambda = 1), z = rexp(100))
pairs(df)
126
0 1 2 3 4
2
1
x
0
−2
4
3
y
2
1
0
0 1 2 3 4 5 6
z
−2 −1 0 1 2 0 1 2 3 4 5 6
3.1.1.4 Covariation of two discrete variables plot(x,y) produces a spine plot if x and y are factors.
fac1 <- factor(sample(1:3, 100, replace = TRUE), labels=c("red", "green", "blue"))
fac2 <- factor(sample(1:2, 100, replace = TRUE), labels=c("orange", "purple"))
plot(fac1, fac2)
1.0
0.8
orange
0.6
y
0.4
purple
0.2
0.0
3.1.1.5 Covariation of one continuous and one discrete or categorical variable In order to
compare several distributions one can do the following
127
boxplot(x ~ y) produces a boxplot of x for each level of y if x is numeric and y is a factor. This yields the
same plot as plot(x~y).
y <- factor(c(rep("red", 30), rep("blue", 30), rep("green", 30)))
x <- c(rnorm(30, mean = 0), rnorm(30, mean = 2), rnorm(30, mean=1))
boxplot(x ~ y)
4
3
2
x
1
0
−1
−2
y
y <- factor(c(rep("red", 30), rep("blue", 30), rep("green", 30)))
x <- c(rnorm(30, mean = 0), rnorm(30, mean = 2), rnorm(30, mean=1))
plot(x ~ y)
5
4
3
2
x
1
0
−1
−2
y
If the categorical variable is to be plotted on the y axis (and should be seen as a dependent variable in the
128
further analysis), R plots by default a spineplot:
y <- factor(c(rep("red", 30), rep("blue", 30), rep("green", 30)))
x <- c(rnorm(30, mean = 0), rnorm(30, mean = 2), rnorm(30, mean=1))
plot(y ~ x)
1.0
blue
0.8
0.6
green
y
0.4
0.2
red
0.0
−2 0 1 2 3 4
3.1.1.6 Three dimensional data For three dimensional data, the base package provides image(),
contour(), and persp().
image(x,y,z) plots a grid of rectangles along the ascending x, y values and fills them with different colours
to represent the values of z.
require(grDevices)
x <- y <- seq(-4 * pi, 4 * pi, len = 27)
r <- sqrt(outer(xˆ2, yˆ2, "+"))
z <- cos(rˆ2) * exp(-r/6)
image(x, y, z, col = gray((0:32)/32))
129
10
5
0
y
−5
−10
−10 −5 0 5 10
x
contour(x,y,z) draws a contour plot for z.
x <- -6:16
contour(outer(x, x), method = "edge", vfont = c("sans serif", "plain"))
1.0
0.8
0.6
0.4
0.2
0.0
130
persp(x, y, z, theta = 30, phi = 30, expand = 0.5, col = "lightblue")
y
x
131
df <- data.frame("x" = c(rnorm(50, mean = 1), rnorm(50)),
"y" = c(rep("A", 50), rep("B", 50)))
Values
Boxplots
Groups
Subtitle
3.1.3.1 Points and lines The functions points() and lines() add points or lines to the current plot.
The different types can also be specified using the type= argument.
The function abline() is however often more convenient to add straight mathematical lines. It can be used
in the following ways:
• abline(a,b) adds a line with intercept a and slope b.
• abline(h=y) y defines the y coordinate for a horizontal line.
• abline(v=x) x defines the x coordinate for a vertical line.
• abline(lm.object) if lm.object is a list of length 2 it adds a line using the first value as intercept and
the second as slope.
• polygon() adds a polygon to the existing plot.
132
• axis(side,...) can be used to modify all aspects (position, label, tickmarks, . . . ) of the axis. This
function is mainly used when in the high-level plotting function the argument axes was set to FALSE.
• text(x,y,labels,...) adds text to a plot at specified coordinates, which means that labeli is put to
the position (xi , yi ).
• legend(x,y,legend,...) adds a legend to a specified position in the plot.
• fill = v colours of filled boxes.
• lty = v line styles.
• lwd = v line widths.
• col = v colours of points or lines.
• pch = v plotting characters.
The standard colors and plotting characters can be seen in the following figure:
plot(1, 1, xlim = c(1,10), ylim = c(0,5), type = "n", axes = FALSE)
points(1:9, rep(4.5,9),cex = 1:9,col = 1:9, pch = 0:8)
text(1:9, rep(3.5,9),labels = paste(0:9), cex = 1:9, col = 1:9)
points(1:9, rep(2,9), pch = 9:17)
text((1:9) + 0.25, rep(2,9), paste(9:17))
points(1:8, rep(1,8), pch = 18:25)
text((1:8) + 0.25, rep(1,8), paste(18:25))
0
9 1 2 3 45 678
1
9 10 11 12 13 14 15 16 17
18 19 20 21 22 23 24 25
1
Note: Also mathematical symbols and formulae can be added as text, then the labels are rather expressions.
For details see help for plotmath.
df <- data.frame("x"=c(rnorm(50, mean=1), rnorm(50)), "y"=c(rep("A", 50), rep("B", 50)))
133
Boxplots
2.0
Values
1.0
0.0
−1.0
A B
Groups
Subtitle
Sigma1 <- matrix(c(1, 0.7, 0.7, 1), ncol = 2)
Sigma2 <- matrix(c(1, 0, 0, 1), ncol = 2)
mu1 <- c( 1, 0)
mu2 <- c(-1, 1)
n <- 100
df <- data.frame(
rbind(MASS::mvrnorm(n, mu = mu1, Sigma = Sigma1),
MASS::mvrnorm(n, mu = mu2, Sigma = Sigma2)),
"group" = c(rep("A", n), rep("B", n)))
134
4
Group A
Group B
2
Variable 2
0
−2
−4
−4 −2 0 2 4
Variable 1
one could click the left mouse button and the index of the x value of the point closest to the cursor would
appear on the plot. Given a vector of the same length of x and y containing labels for the observation one
could highlight the labels with
identify(x,y, labels)
This would end as soon as all points are identified or pressing the right mouse button and choosing then stop.
135
plot(x,y, pch="*")
3.1.5.2 Parameters concerning the figure margins There are two arguments to control the margins.
The argument mai sets the margins measured in inches, whereas the argument mar measures them in number
of text lines. The margins itself are divided into four parts: the bottom is part 1, left part 2, top part 3 and
right part 4. The different parts are addressed with the corresponding index of the margin vector.
For instance:
mai=c(1,2,3,4) (1 inch bottom, 2 inches left, 3 inches top, 4 inches right)
mar=c(1,2,3,4) (1 line bottom, 2 lines left, 3 lines top, 4 lines right)
136
The outer margins (by default 0) can be set using the oma and omi arguments analogous as the mar and mai
arguments. Text to the outer margins can be added using the mtext function.
3.1.5.3 Multiple plots in one figure There are two possibilities to place several plots in one figure
in base R. The first is by simply setting the mfrow or mfcol parameters, the second is using the function
layout().
mfrow and mfcol allow to put several figures into one window. Each figure still has its own plotting area and
margins, but in addition one can add optionally a surrounding overall outer margin. To do that one has to
define an array which sets the size of the multiple figures environment. The two functions mfcol and mfrow
define such an environment, the only difference is, that mfcol fills the array by columns and mfrows by rows.
par(mfrow=c(1,2))
curve(sin, -2*pi, 2*pi)
curve(tan, -2*pi, 2*pi)
137
1.0
15
10
0.5
5
tan(x)
sin(x)
0.0
0
−5
−0.5
−15
−1.0
−6 −4 −2 0 2 4 6 −6 −4 −2
x
There exists a special function to specify complex plot arrangements: layout().
layout(mat, widths = rep.int(1, ncol(mat)),
heights = rep.int(1, nrow(mat)), respect = FALSE)
It divides the device up into as many rows and columns as there are in matrix mat, with the column-widths
and the row-heights specified in the respective arguments.
x <- pmin(3, pmax(-3, stats::rnorm(50)))
y <- pmin(3, pmax(-3, stats::rnorm(50)))
xhist <- hist(x, breaks = seq(-3,3,0.5), plot = FALSE)
yhist <- hist(y, breaks = seq(-3,3,0.5), plot = FALSE)
top <- max(c(xhist$counts, yhist$counts))
xrange <- c(-3, 3)
yrange <- c(-3, 3)
nf <- layout(matrix(c(2,0,1,3),2,2,byrow = TRUE), c(3,1), c(1,3), TRUE)
layout.show(nf)
138
2
1 3
par(mar = c(3,3,1,1))
plot(x, y, xlim = xrange, ylim = yrange, xlab = "", ylab = "")
par(mar = c(0,3,1,1))
barplot(xhist$counts, axes = FALSE, ylim = c(0, top), space = 0)
par(mar = c(3,0,1,1))
barplot(yhist$counts, axes = FALSE, xlim = c(0, top), space = 0,
horiz = TRUE)
139
3
2
1
0
−1
−2
−3
−3 −2 −1 0 1 2 3
In R several graphic devices can be used at the same time. To start a new device one calls its name. E.g.
windows() opens a new graphic windows when running R under windows. Always the last opened device is
the active one. To reactivate an older window one has to use the function dev.set. dev.set(1) would for
example reactivate the first device. Plotting commands affect only the active device.
3.2 ggplots
According to Wickham (2010), a grammar of graphics is a tool that enables us to concisely describe the
components of a graphic. Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”
or “lineplot”) and gain insight into the deep structure that underlies statistical graphics. This article proposes
a parameterization of the grammar of graphics, based around the idea of building up a graphic from multiple
layers of data.
Consider a simple data set:
140
Table 15: Simple data set to plot.
A B C D
2 3 a 1
1 2 a 2
4 5 b 6
9 10 b 10
Assume we want to create a scatterplot of A and B where variable C is also mapped to the shape of the
points.
The steps we would do to create the plot are the following:
• We would create another data set that reflects the mapping of the x-position in the graph to A,
y-position in the graph to B and shape parameter to C. x-position, y-position, and shape are examples
of aesthetics. All other variables are not needed and would be removed. See Table~16.
• We could create different plots with these aesthetics (we could use bars, lines, points etc). These are
examples of geometric objects.
• The next thing we need to do is to convert these numbers measured in data units to numbers measured
in physical units, things that the computer can display. To do that we need to know that we are going
to use linear scales and a Cartesian coordinate system. We can then convert the data units to aesthetic
units, which have meaning to the underlying drawing system. Also categories are mapped to shapes
such as circles and squares These transformations are the responsibility of scales
• In general, there is another step that one often uses: a statistical transformation. Here we are using the
identity transformation, but there are many others that are useful, such as binning or aggregating.
• We might want to annotate the plot by a title.
Table 16: Simple data set with variables named as aesthetics.
x y shape
2 3 a
1 2 a
4 5 b
9 10 b
141
3.2.1.1 Layers
3.2.1.1.1 Data and mapping Data are obviously a critical part of the plot, but it is important to
remember that they are independent from the other components: we can construct a graphic that can be
applied to multiple datasets. Data are what turns an abstract graphic into a concrete graphic. Along with
the data, we need a specification of which variables are mapped to which aesthetics. For example, we might
map weight to x position, height to y position, and age to size.
In general there are many aesthetics to map to:
• x and y axis
• color (and fill color)
• style (line style, shapes in scatter plots)
• size
• alpha (transparency used when plotting)
3.2.1.1.2 Statistical transformation A statistical transformation, or stat, transforms the data, typically
by summarizing them in some manner. For example, a useful stat is the smoother, which calculates the mean
of y, conditional on x, subject to some restriction that ensures smoothness.
3.2.1.1.3 Geometric object Geometric objects, or geoms for short, control the type of plot that you
create. For example, using a point geom will create a scatterplot, whereas using a line geom will create a line
plot. We can classify geoms by their dimensionality:
• 0d: point, text,
• 1d: path, line (ordered path),
• 2d: polygon, interval.
Geoms are mostly general purpose, but do require certain outputs from a statistic. For example, the boxplot
geom requires the position of the upper and lower fences, upper and lower hinges, the middle bar, and the
outliers. Any statistic used with the boxplot needs to provide these values. Every geom has a default statistic,
and every statistic a default geom. For example, the bin statistic defaults to using the bar geom to produce a
histogram. Overriding these defaults will still produce a valid plot, but it may violate graphical conventions.
3.2.1.1.4 Position adjustment Sometimes we need to tweak the position of the geometric elements on
the plot, when otherwise they would obscure each other.
3.2.1.2 Scales A scale controls the mapping from data to aesthetic attributes, and so we need one scale
for each aesthetic property used in a layer. Scales are common across layers to ensure a consistent mapping
from data to aesthetics. For example, the color gradient scale maps a segment of the real line to a path
through a color space.
3.2.1.3 Coordinate system A coordinate system maps the position of objects onto the plane of the plot.
Position is often specified by two coordinates (x, y), but could be any number of coordinates. The Cartesian
coordinate system is the most common coordinate system for two dimensions, whereas polar coordinates
and various map projections are used less frequently. For higher dimensions, we have parallel coordinates
(a projective geometry), mosaic plots (a hierarchical coordinate system), and linear projections onto the
plane. Coordinate systems affect all position variables simultaneously and differ from scales in that they
also change the appearance of the geometric objects. For example, in polar coordinates, bar geoms look like
segments of a circle. Additionally, scaling is performed before statistical transformation, whereas coordinate
transformations occur afterward.
3.2.1.4 Faceting The faceting specification describes which variables should be used to split up the data,
and how they should be arranged.
142
3.2.2 Examples
These examples have been adapted from materials by Ursula Laa.
As an example we will use the package palmerpenguins.
install.packages("palmerpenguins")
library("ggplot2")
library("palmerpenguins")
143
3000 4000 5000 6000
body_mass_g
Finally we specify the geom we want to draw
ggplot(penguins) +
aes(x = body_mass_g) +
geom_histogram()
144
25
20
15
count
10
145
75
50
count
25
146
75
species
50
count
Adelie
Chinstrap
Gentoo
25
147
60
40
species
count
Adelie
Chinstrap
Gentoo
20
148
Adelie Chinstrap Gentoo
60
40
species
count
Adelie
Chinstrap
Gentoo
20
3000 4000 5000 6000 3000 4000 5000 6000 3000 4000 5000 6000
body_mass_g
Let’s now have a look at geoms other than geom_hist.
Bar charts:
ggplot(penguins, mapping = aes(x = island, fill = species)) +
geom_bar()
149
150
100 species
count
Adelie
Chinstrap
Gentoo
50
150
6000
5000
body_mass_g
4000
3000
151
6000
5000
body_mass_g
sex
female
male
NA
4000
3000
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
152
6000
5000
body_mass_g
4000
3000
Warning: Removed 11 rows containing missing values or values outside the scale range
(`geom_point()`).
153
6000
species
Adelie
5000 Chinstrap
body_mass_g
Gentoo
sex
4000 female
male
NA
3000
154
25
20
15 species
count
Adelie
Chinstrap
10 Gentoo
155
6e−04
4e−04
density
2e−04
0e+00
156
My histogram
Adelie Chinstrap Gentoo
60
40
Counts
20
3000 4000 5000 6000 3000 4000 5000 6000 3000 4000 5000 6000
Body mass in g
Common elements in facets
We might want to have a common element across all facets. For example we might want to draw a vertical
line in the histograms in all facets corresponding to the overall mean in the data set. For this, we need to
create a new data frame which will contain a row for each species with the average body mass for the whole
data set repeated for all species. The vertical line can be drawn using geom_vline(). For geom_vline()
this new data set will be used.
avg_body_mass <- mean(penguins$body_mass_g, na.rm = TRUE)
dat_mean <- data.frame(body_mass_g = rep(avg_body_mass, 3),
species = c("Adelie", "Chinstrap", "Gentoo"))
ggplot(penguins) +
aes(x = body_mass_g, fill = species) +
geom_histogram(binwidth = 500) +
geom_vline(data = dat_mean, aes(xintercept = body_mass_g)) +
facet_wrap(~ species) +
theme(legend.position = "none")
157
Adelie Chinstrap Gentoo
60
40
count
20
3000 4000 5000 6000 3000 4000 5000 6000 3000 4000 5000 6000
body_mass_g
Note that one can use different data sets in the different geoms.
Let’s consider another example the density over the whole sample is plotted in each facet. Again, we need to
create a data set which will repeat the whole vector of the body mass three times, for each species.
dat_density <- data.frame(body_mass_g = rep(penguins$body_mass_g, 3),
species = rep(c("Adelie", "Chinstrap", "Gentoo"),
each = nrow(penguins)))
ggplot(penguins) +
aes(x = body_mass_g, y = stat(density)) +
geom_histogram(binwidth = 500, aes(fill = species)) +
geom_density(data = dat_density) +
facet_wrap(~ species) +
theme(legend.position = "none")
158
Adelie Chinstrap Gentoo
0.00075
density
0.00050
0.00025
0.00000
3000 4000 5000 6000 3000 4000 5000 6000 3000 4000 5000 6000
body_mass_g
3.3.0.1 gganimate Using gganimate you can add transition_ functions to generate an animation
from your ggplot
For example, transition_states() splits up plot data by a discrete variable and animates between the
different states.
# install.packages("gganimate")
library("gganimate")
anim <- ggplot(penguins, mapping = aes(x = flipper_length_mm,
y = body_mass_g,
color = species)) +
geom_point() +
transition_states(species)
anim
It might also be useful to show what each frame in the animation represents.
anim +
ggtitle('Now showing {closest_state}',
subtitle = 'Frame {frame} of {nframes}')
Using the enter_ and exit_ functions you can also fine tune how the data appears and disappears from the
159
plot.
anim +
enter_fade() +
exit_shrink()
3.3.0.2 plotly Using the ggplotly() function from the plotly package you can generate an interactive
graph, e.g. with tooltip text. The function ggplotly() will crawl the ggplot2 figure, extract and translate all
of the attributes of the ggplot2 figure into JSON (the colors, the axes, the chart type, etc), and draw the
graph with plotly.js.
# install.packages("plotly")
library("plotly")
p <- ggplot(penguins,
mapping = aes(x = flipper_length_mm,
y = body_mass_g,
color = species)) +
geom_point()
You can customize the plots by using layout(), which specifies interaction modes, and style(), for modifying
data-level attributes.
With layout() you can modify aspects of the layout such as change default hovermode behavior, stylizing
hover labels (hoverlabel), changing click+drag mode (dragmode) and/or constraining rectangular selections
(dragmode=‘select’) vertically or horizontally (selectdirection), as well as add dropdowns https://plot.ly/r/dr
opdowns/, sliders https://plot.ly/r/sliders/, and rangesliders.
pltly <- pltly %>% layout(dragmode = "pan")
pltly
With style() you can control the tooltip content, turn hovering on/off or add marker points to lines.
In this example, information will be provided when hovering over the red group, but not the other two (see
Figure below):
160
pltly <- pltly %>% style(p, hoverinfo = "none", traces = 2:3)
library("shiny")
161
look, and how it should behave. A convenient way in RStudio is to create a new directory and an app.R file
containing a basic app in one step by clicking File | New Project, then selecting New Directory and Shiny
Web Application. This will create a template file which can be modified.
Try it out and add the following code to app.R:
library(shiny)
ui <- fluidPage(
"Hello, world!"
)
server <- function(input, output, session) {
}
shinyApp(ui, server)
162
output$summary <- renderPrint({
dataset <- get(input$dataset, "package:datasets")
summary(dataset)
})
}
shinyApp(ui, server)
4.2 Basic UI
The basic UI is built from input and output functions.
• The input functions are used to insert inputs from the user into your app.
• The output functions are used create placeholders in the UI that are later filled by the server function.
4.2.1 Inputs
All input functions have the same first argument: inputId. This is the identifier used to connect the front
end with the back end: if your UI has an input with ID "name", the server function will access it with
input$name.
The inputId must be a simple string that contains only letters, numbers, and underscores (no spaces, dashes,
periods, or other special characters allowed!). Name it like you would name a variable in R. Also, it must be
unique. Otherwise there is no way to refer to this control in the server function.
Most input functions have a second parameter called label. This is used to create a human-readable label
for the control. The third parameter is typically value, which, where possible, lets you set the default value.
The remaining parameters are unique to the control.
The following contains an overview on the main input functions implemented in shiny.
4.2.1.1 Numeric inputs To collect numeric values, create a constrained text box with numericInput()
or a slider with sliderInput():
ui <- fluidPage(
numericInput("num", "Number one", value = 0, min = 0, max = 100),
sliderInput("num2", "Number two", value = 50, min = 0, max = 100),
sliderInput("rng", "Range", value = c(10, 20), min = 0, max = 100)
)
4.2.1.2 Limited choices There are two different approaches to allow the user to choose from a prespecified
set of options: selectInput() and radioButtons().
animals <- c("dog", "cat", "mouse", "bird", "other", "I do not like animals")
ui <- fluidPage(
selectInput("state", "What's your favourite state?", state.name),
radioButtons("animal", "What's your favourite animal?", animals)
)
Dropdowns created with selectInput() are more suitable for longer lists, while radioButtons() will show
all possible choices.
You can also set multiple = TRUE in selectInput() to allow the user to select multiple elements.
163
ui <- fluidPage(
selectInput(
"state", "What's your favourite state?", state.name,
multiple = TRUE
)
)
There is no way to select multiple values with radio buttons, but for this purpose you can use
checkboxGroupInput().
ui <- fluidPage(
checkboxGroupInput("animal", "What animals do you like?", animals)
)
4.2.1.3 Dates Collect a single day with dateInput() or a range of two days with dateRangeInput():
ui <- fluidPage(
dateInput("dob", "When were you born?"),
dateRangeInput("holiday", "When do you want to go on vacation next?")
)
4.2.1.4 Free text Collect small amounts of text with textInput(), passwords with passwordInput(),
and paragraphs of text with textAreaInput().
ui <- fluidPage(
textInput("name", "What's your name?"),
passwordInput("password", "What's your password?"),
textAreaInput("story", "Tell me about yourself", rows = 3)
)
4.2.1.5 File upload Allow the user to upload a file with fileInput():
ui <- fluidPage(
fileInput("upload", NULL)
)
fileInput() requires special handling on the server side, and is discussed later in the chapter.
4.2.1.6 Action buttons The user can perform an action with actionButton() or actionLink():
ui <- fluidPage(
actionButton("click", "Click me!"),
actionButton("drink", "Drink me!", icon = icon("cocktail"))
)
Actions links and buttons are paired with observeEvent() or eventReactive() in the server function. We
will learn more about them in the next part.
4.2.2 Outputs
Like inputs, outputs take a unique ID as their first argument: if the UI specification creates an output with
ID "plot", you can access it in the server function with output$plot.
Each output function on the front end is coupled with a render function in the back end. There are three
main types of output, corresponding to the three things you usually include in a report:
• text,
164
• tables, and
• plots.
The following sections show you the basics of the output functions on the front end, along with the
corresponding render functions in the back end.
4.2.2.1 Text Output regular text with textOutput() and fixed code and console output with
verbatimTextOutput().
ui <- fluidPage(
textOutput("text"),
verbatimTextOutput("code")
)
server <- function(input, output, session) {
output$text <- renderText({
"Hello friend!"
})
output$code <- renderPrint({
summary(1:10)
})
}
4.2.2.2 Tables There are two options for displaying data frames in tables:
• tableOutput() and renderTable() render a static table of data, showing all the data at once.
• dataTableOutput() and renderDataTable() render a dynamic table, showing a fixed number of rows
along with controls to change which rows are visible.
ui <- fluidPage(
tableOutput("static"),
dataTableOutput("dynamic")
)
server <- function(input, output, session) {
output$static <- renderTable(head(mtcars))
output$dynamic <- renderDataTable(mtcars, options = list(pageLength = 5))
}
4.2.2.3 Plots You can display any type of R graphic (base, ggplot2, or otherwise) with plotOutput()
and renderPlot():
ui <- fluidPage(
plotOutput("plot", width = "400px")
)
server <- function(input, output, session) {
output$plot <- renderPlot(plot(1:5), res = 96)
}
If you have an interactive graph you can use plotlyOutput() with renderPlotly():
library(shiny)
library(plotly)
ui <- fluidPage(
plotlyOutput("interactive_plot", width = "400px")
)
server <- function(input, output, session) {
165
output$interactive_plot <- renderPlotly({
p <- ggplot(mtcars, aes(x = hp, y = mpg, col = factor(am))) +
geom_point()
ggplotly(p)
})
}
4.2.2.4 Downloads One can let the user download a file with downloadButton() or downloadLink(). We
will not go into details here but if interested, check https://mastering-shiny.org/action-transfer.html#action-
transfer
4.3.1.1 Input argument The input argument is a list-like object that contains all the input data sent
from the browser, named according to the inputID. For example, if the UI contains a numeric input control
with an inputID of count:
ui <- fluidPage(
numericInput("count", label = "Number of values", value = 100)
)
then you can access the value of that input with input$count. It will initially contain the value 100, and it
will be automatically updated as the user changes the value in the browser.
Unlike a typical list, input objects are read-only. If you attempt to modify an input inside the server
function, you’ll get an error (run this on your computer to see the error message):
server <- function(input, output, session) {
input$count <- 10
}
shinyApp(ui, server)
One more important thing about input: it’s selective about who is allowed to read it. To read from an input,
you must be in a reactive context created by a function like renderText() or reactive().
This code illustrates the error you will get see if you make this mistake (run this on your computer to see the
error message):
server <- function(input, output, session) {
message("The value of input$count is ", input$count)
}
shinyApp(ui, server)
166
4.3.1.2 Output argument The argument output is very similar to input: it is a list-like object named
according to the outputID. The main difference is that we use it for sending output instead of receiving input.
The output object should be used together with a render function, as in the following simple example:
ui <- fluidPage(
textOutput("greeting")
)
The render function sets up a special reactive context that automatically tracks what inputs the output uses.
It also converts the output of the R code into HTML suitable for display on a web page.
If we run the app and type in the name box, we see that the greeting updates automatically as you type (try
it out). Note that one does not need to tell an output when to update, because Shiny does this automatically.
The style of programming used in Shiny is declarative, that means the programmer expresses higher-level
goals or describes important constraints, and relies on someone else or on the software to decide how and/or
when to translate that into action.
One of the strengths of declarative programming in Shiny is that it allows apps to be extremely lazy. A
Shiny app will only ever do the minimal amount of work needed to update the output controls that you can
currently see in the UI. Shiny’s laziness has another important property. In most R code, you can understand
the order of execution by reading the code from top to bottom. That doesn’t work in Shiny, because code
is only run when needed. To understand the order of execution you need to instead look at the reactive
graph, which describes how inputs and outputs are connected. The reactive graph contains one symbol for
every input and output, and an input is connected to an output whenever the output accesses the input.
For the greeting app above the reactive graph is very simple an looks like this:
One more important component in the reactive graph are the reactive expressions. One can think of them
as a tool that reduces duplication in the reactive code by introducing additional nodes into the reactive graph.
For the greeting app we don’t necessarily need a reactive expression, but let’s see how we can use one:
server <- function(input, output, session) {
string <- reactive(paste0("Hello ", input$name, "!"))
output$greeting <- renderText(string())
}
167
The reactive graph becomes
Finally, note that due to laziness, the order of the lines in the code does not matter (might not be intuitive
given the way code is usually run in R). So the following code yields the same reaction as the previous one:
server <- function(input, output, session) {
output$greeting <- renderText(string())
string <- reactive(paste0("Hello ", input$name, "!"))
}
4.4 Layout
Layout functions provide the high-level visual structure of an app. Layouts are created by a hierarchy of
function calls, where the hierarchy in R matches the hierarchy in the generated HTML. For example, look at
the following code for UI. It should be intuitively clear what it does.
fluidPage(
titlePanel("Hello Shiny!"),
sidebarLayout(
sidebarPanel(
sliderInput("obs", "Observations:", min = 0, max = 1000, value = 500)
),
mainPanel(
plotOutput("distPlot")
)
)
)
It will generate a classic app design: a title bar at top, followed by a sidebar (containing a slider) and main
panel (containing a plot).
This section contains a brief introduction to the main layout functions in Shiny. For more information on
the customization of Shiny apps see https://mastering-shiny.org/action-layout.html#action-layout and
references therein.
168
sidebarPanel(
# inputs
),
mainPanel(
# outputs
)
)
)
4.4.3 Multi-row
As usual, you start with fluidPage(). Then you create rows with fluidRow(), and columns with column().
fluidPage(
fluidRow(
column(4,
...
),
column(8,
...
)
),
fluidRow(
column(6,
...
),
column(6,
...
)
)
)
169
4.4.4 Multi-page layouts
4.4.4.1 Tabsets The simple way to break up a page into pieces is to use tabsetPanel() and tabPanel().
ui <- fluidPage(
tabsetPanel(
tabPanel("Import data",
fileInput("file", "Data", buttonLabel = "Upload..."),
textInput("delim", "Delimiter (leave blank to guess)", ""),
numericInput("skip", "Rows to skip", 0, min = 0),
numericInput("rows", "Rows to preview", 10, min = 1)
),
tabPanel("Set parameters"),
tabPanel("Visualise results")
)
)
4.4.4.2 Navlists and navbars Because tabs are displayed horizontally, there is a fundamental limit
to how many tabs you can use, particularly if they have long titles. navbarPage() and navbarMenu()
provide two alternative layouts that let you use more tabs with longer titles. navlistPanel() is similar to
tabsetPanel() but instead of running the tab titles horizontally, it shows them vertically in a sidebar.
170
ui <- fluidPage(
navlistPanel(
id = "tabset",
"Heading 1",
tabPanel("panel 1", "Panel one contents"),
"Heading 2",
tabPanel("panel 2", "Panel two contents"),
tabPanel("panel 3", "Panel three contents")
)
)
4.5 Dynamic UI
So far, the user interface has been defined statically when the app is launched so it can’t respond to anything
that happens in the app. However, we can also create dynamic user interfaces, changing the UI using code
run in the server function.
There are three key techniques for creating dynamic user interfaces:
• Using the update family of functions to modify parameters of input controls.
• Using tabsetPanel() to conditionally show and hide parts of the user interface (for examples see
https://mastering-shiny.org/action-dynamic.html).
• Using uiOutput() and renderUI() to generate selected parts of the user interface with code.
Here is a simple example, where once a minimum and maximum has been specified by the user, the ends of
the slider change accordingly:
ui <- fluidPage(
numericInput("min", "Minimum", 0),
numericInput("max", "Maximum", 3),
sliderInput("n", "n", min = 0, max = 3, value = 1)
)
server <- function(input, output, session) {
observeEvent(input$min, {
updateSliderInput(inputId = "n", min = input$min)
})
observeEvent(input$max, {
updateSliderInput(inputId = "n", max = input$max)
})
}
171
Another simple example resets the slider value to zero every time the reset action button is clicked.
ui <- fluidPage(
sliderInput("x1", "x1", 0, min = -10, max = 10),
sliderInput("x2", "x2", 0, min = -10, max = 10),
sliderInput("x3", "x3", 0, min = -10, max = 10),
actionButton("reset", "Reset")
)
If we want to create a specific type of inputs depending on other inputs, we can use uiOutput() and
renderUI(). Here is an example that dynamically creates an input control, with the type and label control
by two other inputs:
ui <- fluidPage(
textInput("label", "label"),
selectInput("type", "type", c("slider", "numeric")),
uiOutput("numeric")
)
server <- function(input, output, session) {
output$numeric <- renderUI({
if (input$type == "slider") {
sliderInput("dynamic", input$label, value = 0, min = 0, max = 10)
} else {
numericInput("dynamic", input$label, value = 0, min = 0, max = 10)
}
})
}
4.6 Validation
The most important feedback you can give to the user is whether they have given a bad input. Thinking
through how the user might misuse your app allows you to provide informative messages in the UI, rather
than allowing errors to trickle through into the R code and generate uninformative errors. A great way to
give additional feedback to the user is via the shinyFeedback package (Merlino and Howard 2021).
First, you add useShinyFeedback() to the ui
ui <- fluidPage(
shinyFeedback::useShinyFeedback(),
numericInput("n", "n", value = 10),
textOutput("half")
)
Then in the server() function, you call one of the feedback functions: feedback(), feedbackWarning(),
feedbackDanger(), and feedbackSuccess(). They all have three key arguments:
• inputId, the id of the input where the feedback should be placed.
• show, a logical determining whether or not to show the feedback.
172
• text, the text to display.
They also have color and icon arguments that you can use to further customise the appearance. See the
documentation for more details.
server <- function(input, output, session) {
half <- reactive({
even <- input$n %% 2 == 0
shinyFeedback::feedbackWarning("n", !even, "Please select an even number")
input$n / 2
})
The error message is displayed, but the output is still updated. To stop inputs from triggering reactive
changes, you need a new tool: req(), short for “required”.
server <- function(input, output, session) {
half <- reactive({
even <- input$n %% 2 == 0
shinyFeedback::feedbackWarning("n", !even, "Please select an even number")
req(even)
input$n / 2
})
shinyFeedback is great when the problem is related to a single input. But sometimes the invalid state is a
result of a combination of inputs. In this case it doesn’t really make sense to put the error next to an input
(which one would you put it beside?) and instead it makes more sense to put it in the output. You can do so
with validate(). When called inside a reactive or an output, validate(message) stops execution of the
rest of the code and instead displays message in any downstream outputs. The following code shows a simple
example where we don’t want to log or square-root negative values.
ui <- fluidPage(
numericInput("x", "x", value = 0),
selectInput("trans", "transformation",
choices = c("square", "log", "square-root")
),
textOutput("out")
)
switch(input$trans,
square = input$x ˆ 2,
"square-root" = sqrt(input$x),
log = log(input$x)
)
})
173
}
References
Chang, Winston, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan
McPherson, Alan Dipert, and Barbara Borges. 2021. Shiny: Web Application Framework for r. https:
//CRAN.R-project.org/package=shiny.
Chang, Winston, Javier Luraschi, and Timothy Mastny. 2020. Profvis: Interactive Visualizations for Profiling
r Code. https://CRAN.R-project.org/package=profvis.
Dahl, David B., David Scott, Charles Roosen, Arni Magnusson, and Jonathan Swinton. 2019. Xtable: Export
Tables to LaTeX or HTML. https://CRAN.R-project.org/package=xtable.
Dowle, Matt, and Arun Srinivasan. 2021. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-
project.org/package=data.table.
Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of
Statistical Software 40 (3): 1–25. https://www.jstatsoft.org/v40/i03/.
Hester, Jim, and Davis Vaughan. 2021. bench: High Precision Timing of r Expressions. https://CRAN.R-
project.org/package=bench.
Leifeld, Philip. 2013. “texreg: Conversion of Statistical Model Output in R to LaTeX and HTML Tables.”
Journal of Statistical Software 55 (8): 1–24. http://dx.doi.org/10.18637/jss.v055.i08.
Merlino, Andy, and Patrick Howard. 2021. shinyFeedback: Display User Feedback in Shiny Apps. https:
//CRAN.R-project.org/package=shinyFeedback.
Müller, Kirill, and Hadley Wickham. 2021. Tibble: Simple Data Frames. https://CRAN.R-project.org/pack
age=tibble.
Ripley, Brian, and Michael Lapsley. 2021. RODBC: ODBC Database Access. https://CRAN.R-project.org/
package=RODBC.
Temple Lang, Duncan. 2021. XML: Tools for Parsing and Generating XML Within r and s-Plus. https:
//CRAN.R-project.org/package=XML.
Wickham, Hadley. 2010. “A Layered Grammar of Graphics.” https://byrneslab.net/classes/biol607/readings
/wickham_layered-grammar.pdf.
———. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidy
verse.org.
———. 2019. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-
project.org/package=stringr.
———. 2021. Forcats: Tools for Working with Categorical Variables (Factors). https://CRAN.R-
project.org/package=forcats.
Wickham, Hadley, and Jennifer Bryan. 2022. Readxl: Read Excel Files. https://CRAN.R-project.org/packag
e=readxl.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022. Dplyr: A Grammar of Data
Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, and Maximilian Girlich. 2022. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/pa
ckage=tidyr.
Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2021. Readr: Read Rectangular Text Data. https:
//CRAN.R-project.org/package=readr.
Wickham, Hadley, and Evan Miller. 2021. Haven: Import and Export ’SPSS’, ’Stata’ and ’SAS’ Files.
https://CRAN.R-project.org/package=haven.
174