R Markdown & Knitr Guide
R Markdown & Knitr Guide
Syntax Becomes
Plain text
superscript^2^
~~strikethrough~~
[link](www.rstudio.com)
# Header 1
## Header 2
### Header 3
#### Header 4
##### Header 5
###### Header 6
endash: --
emdash: ---
ellipsis: ...
image: 
***
* unordered list
* item 2
+ sub-item 1
+ sub-item 2
1. ordered list
2. item 2
+ sub-item 1
+ sub-item 2
2
R Markdown Reference Guide
Learn more about R Markdown at rmarkdown.rstudio.com
Contents:
1. Markdown Syntax
Learn more about Interactive Docs at shiny.rstudio.com/articles
2. Knitr chunk options
3. Pandoc options
Syntax Becomes
Make a code chunk with three back ticks followed
by an r in braces. End the chunk with three back
ticks:
```{r}
paste("Hello", "World!")
```
Chunk options
option default value description
Code evaluation
child NULL A character vector of filenames. Knitr will knit the files and place them into the main document.
code NULL Set to R code. Knitr will replace the code in the chunk with the code in the code option.
Knitr will evaluate the chunk in the named language, e.g. engine = 'python'. Run names(knitr::knit_engines$get()) to
engine 'R'
see supported languages.
eval TRUE If FALSE, knitr will not run the code in the code chunk.
include TRUE If FALSE, knitr will run the chunk but not include the chunk in the final document.
purl TRUE If FALSE, knitr will not include the chunk when running purl() to extract the source code.
Results
collapse FALSE If TRUE, knitr will collapse all the source and output blocks created by the chunk into a single block.
echo TRUE If FALSE, knitr will not display the code in the code chunk above it’s results in the final document.
If 'hide', knitr will not display the code’s results in the final document. If 'hold', knitr will delay displaying all output
results 'markup' pieces until the end of the chunk. If 'asis', knitr will pass through results without reformatting them (useful if results
return raw HTML, etc.)
error TRUE If FALSE, knitr will not display any error messages generated by the code.
message TRUE If FALSE, knitr will not display any messages generated by the code.
warning TRUE If FALSE, knitr will not display any warning messages generated by the code.
Code Decoration
comment '##' A character string. Knitr will append the string to the start of each line of results in the final document.
highlight TRUE If TRUE, knitr will highlight the source code in the final output.
prompt FALSE If TRUE, knitr will add > to the start of each line of code displayed in the final document.
strip.white TRUE If TRUE, knitr will remove white spaces that appear at the beginning or end of a code chunk.
tidy FALSE If TRUE, knitr will tidy code chunks for display with the tidy_source() function in the formatR package.
3
R Markdown Reference Guide
Learn more about R Markdown at rmarkdown.rstudio.com
Contents:
1. Markdown Syntax
Learn more about Interactive Docs at shiny.rstudio.com/articles
2. Knitr chunk options
3. Pandoc options
Chunk options (Continued)
option default value description
Chunks
opts.label NULL The label of options set in knitr:: opts_template() to use with the chunk.
R.options NULL Local R options to use with the chunk. Options are set with options() at start of chunk. Defaults are restored at end.
ref.label NULL A character vector of labels of the chunks from which the code of the current chunk is inherited.
Cache
autodep FALSE If TRUE, knitr will attempt to figure out dependencies between chunks automatically by analyzing object names.
cache FALSE If TRUE, knitr will cache the results to reuse in future knits. Knitr will reuse the results until the code chunk is altered.
cache.comments NULL If FALSE, knitr will not rerun the chunk if only a code comment has changed.
cache.lazy TRUE If TRUE, knitr will use lazyload() to load objects in chunk. If FALSE, knitr will use load() to load objects in chunk.
cache.path 'cache/' A file path to the directory to store cached results in. Path should begin in the directory that the .Rmd file is saved in.
cache.vars NULL A character vector of object names to cache if you do not wish to cache each object in the chunk.
A character vector of chunk labels to specify which other chunks a chunk depends on. Knitr will update a cached
dependson NULL
chunk if its dependencies change.
Animation
anipots 'controls,loop' Extra options for animations (see the animate package).
interval 1 The number of seconds to pause between animation frames.
Plots
dev 'png' The R function name that will be used as a graphical device to record plots, e.g. dev='CairoPDF'.
dev.args NULL Arguments to be passed to the device, e.g. dev.args=list(bg='yellow', pointsize=10).
dpi 72 A number for knitr to use as the dots per inch (dpi) in graphics (when applicable).
external TRUE If TRUE, knitr will externalize tikz graphics to save LaTex compilation time (only for the tikzDevice::tikz() device).
fig.align 'default' How to align graphics in the final document. One of 'left', 'right', or 'center'.
fig.cap NULL A character string to be used as a figure caption in LaTex.
fig.env 'figure' The Latex environment for figures.
fig.ext NULL The file extension for figure output, e.g. fig.ext='png'.
fig.height, fig.width 7 The width and height to use in R for plots created by the chunk (in inches).
If 'high', knitr will merge low-level changes into high level plots. If 'all', knitr will keep all plots (low-level changes may
fig.keep 'high' produce new plots). If 'first', knitr will keep the first plot only. If 'last', knitr will keep the last plot only. If 'none', knitr
will discard all plots.
4
R Markdown Reference Guide
Learn more about R Markdown at rmarkdown.rstudio.com
Contents:
1. Markdown Syntax
Learn more about Interactive Docs at shiny.rstudio.com/articles
2. Knitr chunk options
3. Pandoc options
Templates Basic YAML Template options Latex options Interactive Docs
--- ---
html_document --- ---
title: "Chapters" title: "Slides"
pdf_document title: "A Web Doc" title: "My PDF"
output: output:
word_document author: "John Doe" output: pdf_document
md_document html_document: slidy_presentation:
date: "May 1, 2015" fontsize: 11pt
ioslides_presentation toc: true incremental: true
output: md_document geometry: margin=1in
slidy_presentation toc_depth: 2 runtime: shiny
--- ---
beamer_presentation --- ---
## Header 2
You can start a new slide with a horizontal rule`***` if you do not want
a header.
## Bullets
- a dash
- another dash
## Incremental bullets
ioslides slidy
f - enable fullscreen mode C - show table of contents
w - toggle widescreen mode F - toggle display of the footer
o - enable overview mode A - toggle display of current vs all slides
h - enable code highlight mode S - make fonts smaller
p - show presenter notes B - make fonts bigger
mainfont, sansfont, monofont, mathfont Document fonts (works only with xelatex and lualatex, see the latex_engine option)
linkcolor, urlcolor, citecolor Color for internal, external, and citation links (red, green, magenta, cyan, blue, black)
5
R Markdown Reference Guide
Learn more about R Markdown at rmarkdown.rstudio.com
Contents:
1. Markdown Syntax
Learn more about Interactive Docs at shiny.rstudio.com/articles
2. Knitr chunk options
3. Pandoc options
beamer
ioslides
word
slidy
html
pdf
md
option description
6
Base R Vectors Programming
Creating Vectors For Loop While Loop
Cheat Sheet c(2, 4, 6) 2 4 6
Join elements into
for (variable in sequence){ while (condition){
a vector
Do something Do something
Getting Help An integer
2:6 2 3 4 5 6
sequence } }
RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • [email protected] Learn more at web page or vignette • package version • Updated: 3/15 7
Types Matrices Strings Also see the stringr package.
m <- matrix(x, nrow = 3, ncol = 3) paste(x, y, sep = ' ')
Converting between common data types in R. Can always go Join multiple vectors together.
Create a matrix from x.
from a higher value in the table to a lower value.
paste(x, collapse = ' ') Join elements of a vector together.
m[2, ] - Select a row t(m)
w
ww Transpose
grep(pattern, x) Find regular expression matches in x.
ww
as.logical TRUE, FALSE, TRUE Boolean values (TRUE or FALSE).
w m[ , 1] - Select a column
m %*% n gsub(pattern, replace, x) Replace matches in x with a string.
ww
as.numeric 1, 0, 1
numbers.
w
ww
ww
preferred to factors. nchar(x) Number of characters in a string.
as.factor
'1', '0', '1',
levels: '1', '0'
Character strings with preset
levels. Needed for some
statistical models.
w Lists Factors
l <- list(x = 1:5, y = c('a', 'b')) factor(x) cut(x, breaks = 4)
Maths Functions A list is a collection of elements which can be of different types. Turn a vector into a factor. Can
set the levels of the factor and
Turn a numeric vector into a
factor by ‘cutting’ into
log(x) Natural log. sum(x) Sum. l[[2]] l[1] l$x l['y'] the order. sections.
New list with New list with
exp(x) Exponential. mean(x) Mean. Second element Element named
only the first only element
max(x) Largest element. median(x) Median.
of l.
element.
x.
named y. Statistics
min(x) Smallest element. quantile(x) Percentage
lm(y ~ x, data=df) prop.test
Also see the t.test(x, y)
quantiles.
dplyr package. Data Frames Linear model. Perform a t-test for Test for a
round(x, n) Round to n decimal rank(x) Rank of elements. difference
difference between
places. glm(y ~ x, data=df) between
df <- data.frame(x = 1:3, y = c('a', 'b', 'c')) means.
Generalised linear model. proportions.
signif(x, n) Round to n var(x) The variance. A special case of a list where all elements are the same length.
significant figures. pairwise.t.test
List subsetting summary aov
Perform a t-test for
cor(x, y) Correlation. sd(x) The standard x y Get more detailed information Analysis of
paired data.
deviation. out a model. variance.
df$x df[[2]]
1 a
Variable Assignment Distributions
2 b Understanding a data frame
> a <- 'apple' Random Density Cumulative
Quantile
> a See the full data Variates Function Distribution
3 c View(df)
[1] 'apple' frame. Normal rnorm dnorm pnorm qnorm
See the first 6
Matrix subsetting head(df) Poisson rpois dpois ppois qpois
rows.
The Environment Binomial rbinom dbinom pbinom qbinom
df[ , 2]
ls() List all variables in the nrow(df) cbind - Bind columns. Uniform runif dunif punif qunif
environment. Number of rows.
columns.
rm(list = ls()) Remove all variables from the rbind - Bind rows. plot(x) plot(x, y) hist(x)
environment. Values of x in Values of x Histogram of
dim(df)
Number of order. against y. x.
You can use the environment panel in RStudio to
df[2, 2] columns and
browse variables in your environment. rows.
Dates See the lubridate package.
RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at readr.tidyverse.org • readr 2.0.0 • readxl 1.3.1 • googlesheets4 1.0.0 • Updated: 2021-08
9
Import Spreadsheets
with readxl with googlesheets4
READ EXCEL FILES READ SHEETS
A B C D E A B C D E
1 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 1 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5
2 x z 8 x NA z 8 NA 2 x z 8 x NA z 8 NA
3 y 7 9 10 y 7 NA 9 10 READXL COLUMN SPECIFICATION 3 y 7 9 10 y 7 NA 9 10 GOOGLESHEETS4 COLUMN SPECIFICATION
s1 s1
Column specifications define what data type Column specifications define what data type
each column of a file will be imported as. each column of a file will be imported as.
read_excel(path, sheet = NULL, range = NULL) read_sheet(ss, sheet = NULL, range = NULL)
Read a .xls or .xlsx file based on the file extension. Read a sheet from a URL, a Sheet ID, or a dribble
Use the col_types argument of read_excel() to Use the col_types argument of read_sheet()/
See front page for more read arguments. Also from the googledrive package. See front page for
set the column specification. range_read() to set the column specification.
read_xls() and read_xlsx(). more read arguments. Same as range_read().
read_excel("excel_file.xlsx")
Guess column types Guess column types
To guess a column type, read_ excel() looks at SHEETS METADATA To guess a column type read_sheet()/
READ SHEETS the first 1000 rows of data. Increase with the URLs are in the form: range_read() looks at the first 1000 rows of data.
guess_max argument. https://docs.google.com/spreadsheets/d/ Increase with guess_max.
A B C D E read_excel(path, sheet = read_excel(path, guess_max = Inf) read_sheet(path, guess_max = Inf)
NULL) Specify which sheet SPREADSHEET_ID/edit#gid=SHEET_ID
to read by position or name. Set all columns to same type, e.g. character gs4_get(ss) Get spreadsheet meta data. Set all columns to same type, e.g. character
read_excel(path, sheet = 1) read_excel(path, col_types = "text") read_sheet(path, col_types = "c")
s1 s2 s3
read_excel(path, sheet = "s1") gs4_find(...) Get data on all spreadsheet files.
Set each column individually sheet_properties(ss) Get a tibble of properties Set each column individually
read_excel( for each worksheet. Also sheet_names(). # col types: skip, guess, integer, logical, character
excel_sheets(path) Get a
vector of sheet names. path, read_sheets(ss, col_types = "_?ilc")
s1 s2 s3
col_types = c("text", "guess", "guess",“numeric") WRITE SHEETS
excel_sheets("excel_file.xlsx")
) A B C write_sheet(data, ss =
1 x 4 1 1 x 4 NULL, sheet = NULL) COLUMN TYPES
A B C D E To read multiple sheets: 2 y 5 2 2 y 5
Write a data frame into a
COLUMN TYPES l n c D L
A B C D E 1. Get a vector of sheet 3 z 6 3 3 z 6
new or existing Sheet. TRUE 2 hello 1947-01-08 hello
s1
names from the file path. logical numeric text date list FALSE 3.45 world 1956-10-21 1
A B C D E gs4_create(name, ...,
2. Set the vector names to TRUE 2 hello 1947-01-08 hello
s1 s2 A B C D sheets = NULL) Create a
be the sheet names. FALSE 3.45 world 1956-10-21 1 • skip - "_" or "-" • date - "D"
1 new Sheet with a vector
s1 s2 3. Use purrr::map_dfr() to • guess - "?" • datetime - "T"
• skip • logical • date 2 of names, a data frame,
s1 s2 s3 read multiple files into • logical - "l" • character - "c"
• guess • numeric • list s1 or a (named) list of data
one data frame. • integer - "i" • list-column - "L"
• text frames.
• double - "d" • cell - "C" Returns
path <- "your_file_path.xlsx" A B C
sheet_append(ss, data,
x1 x2 x3 1 x1 x2 x3 • numeric - "n" list of raw cell data.
path %>% excel_sheets() %>% Use list for columns that include multiple data 2 1 x 4 sheet = 1) Add rows to
2 y 5
set_names() %>% types. See tidyr and purrr for list-column data. 3 z 6 3 2 y 5 the end of a worksheet. Use list for columns that include multiple data
map_dfr(read_excel, path = path) 4 3 z 6 types. See tidyr and purrr for list-column data.
s1
OTHER USEFUL EXCEL PACKAGES CELL SPECIFICATION FOR READXL AND GOOGLESHEETS4 FILE LEVEL OPERATIONS
For functions to write data to Excel files, see: Use the range argument of readxl::read_excel() or googlesheets4 also offers ways to modify other
• openxlsx googlesheets4::read_sheet() to read a subset of cells from a aspects of Sheets (e.g. freeze rows, set column
• writexl A B C D E sheet. width, manage (work)sheets). Go to
1 1 2 3 4 5 2 3 4 read_excel(path, range = "Sheet1!B1:D2") googlesheets4.tidyverse.org to read more.
For working with non-tabular Excel data, see: 2 x y z NA y z read_sheet(ss, range = "B1:D2")
• tidyxl 3 6 7 9 10 For whole-file operations (e.g. renaming, sharing,
s1 Also use the range argument with cell specification functions placing within a folder), see the tidyverse
cell_limits(), cell_rows(), cell_cols(), and anchored(). package googledrive at
googledrive.tidyverse.org.
RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • readxl.tidyverse.org and googlesheets4.tidyverse.org • readr 2.0.0 • readxl 1.3.1 • googlesheets4 1.0.0 • Updated: 2021-08
10
rmarkdown : : CHEAT SHEET SOURCE EDITOR
RENDERED OUTPUT file path to output document
RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at rmarkdown.rstudio.com • rmarkdown 2.9.4 • Updated: 2021-08
11
Set Output Formats and their Options in YAML Render
MS Word
MS PPT
HTML
PDF
Use the document's YAML header to set an output IMPORTANT OPTIONS DESCRIPTION When you render a
format and customize it with output options. anchor_sections Show section anchors on mouse hover (TRUE or FALSE) X document, rmarkdown:
--- citation_package The LaTeX package to process citations ("default", "natbib", "biblatex") X 1. Runs the code and embeds
title: "My Document" results and text into an .md
author: "Author Name" code_download Give readers an option to download the .Rmd source code (TRUE or FALSE) X
file with knitr.
output: code_folding Let readers to toggle the display of R code ("none", "hide", or "show") X
html_document: Indent format 2 characters, 2. Converts the .md file into the output format with
toc: TRUE css CSS or SCSS file to use to style document (e.g. "style.css") X Pandoc.
indent options 4 characters
--- dev Graphics device to use for figure output (e.g. "png", "pdf") X X HTML
knitr pandoc
.Rmd .md PDF
df_print Method for printing data frames ("default", "kable", "tibble", "paged") X X X X DOC
OUTPUT FORMAT CREATES
html_document .html fig_caption Should figures be rendered with captions (TRUE or FALSE) X X X X
Save, then Knit to preview the document output.
pdf_document* .pdf highlight Syntax highlighting ("tango", "pygments", "kate", "zenburn", "textmate") X X X The resulting HTML/PDF/MS Word/etc. document will
word_document Microsoft Word (.docx) includes File of content to place in doc ("in_header", "before_body", "after_body") X X be created and saved in the same directory as
powerpoint_presentation Microsoft Powerpoint (.pptx)
the .Rmd file.
keep_md Keep the Markdown .md file generated by knitting (TRUE or FALSE) X X X X
odt_document OpenDocument Text Use rmarkdown::render() to render/knit in the R
keep_tex Keep the intermediate TEX file used to convert to PDF (TRUE or FALSE) X
rtf_document Rich Text Format console. See ?render for available options.
latex_engine LaTeX engine for producing PDF output ("pdflatex", "xelatex", or "lualatex") X
md_document Markdown
github_document
ioslides_presentation
Markdown for Github
ioslides HTML slides
reference_docx/_doc
theme
docx/pptx file containing styles to copy in the output (e.g. "file.docx", "file.pptx")
Theme options (see Bootswatch and Custom Themes below) X
X X
Share
Publish on RStudio Connect
slidy_presentation Slidy HTML slides toc Add a table of contents at start of document (TRUE or FALSE) X X X X to share R Markdown documents
beamer_presentation* Beamer slides toc_depth The lowest level of headings to add to table of contents (e.g. 2, 3) X X X X securely, schedule automatic
* Requires LaTeX, use tinytex::install_tinytex()
toc_float Float the table of contents to the left of the main document content (TRUE or FALSE) X updates, and interact with parameters in real time.
Also see flexdashboard, bookdown, distill, and blogdown.
Use ?<output format> to see all of a format's options, e.g. ?html_document rstudio.com/products/connect/
RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at rmarkdown.rstudio.com • rmarkdown 2.9.4 • Updated: 2021-08
12
Data transformation with dplyr : : CHEAT SHEET
dplyr functions work with pipes and expect tidy data. In tidy data:
A B C A B C
Manipulate Cases Manipulate Variables
&
pipes EXTRACT CASES EXTRACT VARIABLES
Row functions return a subset of rows as a new table. Column functions return a set of columns as a new vector or table.
Each variable is in Each observation, or x %>% f(y)
its own column case, is in its own row becomes f(x, y) filter(.data, …, .preserve = FALSE) Extract rows pull(.data, var = -1, name = NULL, …) Extract
Summarise Cases w
www
ww that meet logical criteria.
filter(mtcars, mpg > 20) w
www column values as a vector, by name or index.
pull(mtcars, wt)
w
www
Apply summary functions to columns to create a new table of
w
www
ww
rows with duplicate values. select(mtcars, mpg, wt)
summary statistics. Summary functions take vectors as input and distinct(mtcars, gear)
return one value (see back).
relocate(.data, …, .before = NULL, .after = NULL)
slice(.data, …, .preserve = FALSE) Select rows
w
www
ww
summary function Move columns to new position.
by position. relocate(mtcars, mpg, cyl, .after = last_col())
slice(mtcars, 10:15)
summarise(.data, …)
w
ww w
www
ww
Compute table of summaries. slice_sample(.data, …, n, prop, weight_by =
summarise(mtcars, avg = mean(mpg)) NULL, replace = FALSE) Randomly select rows. Use these helpers with select() and across()
Use n to select a number of rows and prop to e.g. select(mtcars, mpg:cyl)
count(.data, …, wt = NULL, sort = FALSE, name = select a fraction of rows. contains(match) num_range(prefix, range) :, e.g. mpg:cyl
NULL) Count number of rows in each group slice_sample(mtcars, n = 5, replace = TRUE) ends_with(match) all_of(x)/any_of(x, …, vars) -, e.g, -gear
w
ww
defined by the variables in … Also tally(). starts_with(match) matches(match) everything()
count(mtcars, cyl) slice_min(.data, order_by, …, n, prop,
with_ties = TRUE) and slice_max() Select rows
with the lowest and highest values.
MANIPULATE MULTIPLE VARIABLES AT ONCE
Group Cases w
www
ww
slice_min(mtcars, mpg, prop = 0.25)
across(.cols, .funs, …, .names = NULL) Summarise
slice_head(.data, …, n, prop) and slice_tail()
w
ww
Use group_by(.data, …, .add = FALSE, .drop = TRUE) to create a or mutate multiple columns in the same way.
Select the first or last rows. summarise(mtcars, across(everything(), mean))
"grouped" copy of a table grouped by columns in ... dplyr slice_head(mtcars, n = 5)
functions will manipulate each "group" separately and combine
the results. c_across(.cols) Compute across columns in
w
ww
Logical and boolean operators to use with filter() row-wise data.
== < <= is.na() %in% | xor() transmute(rowwise(UKgas), total = sum(c_across(1:2)))
w
www
ww mtcars %>% != > >= !is.na() ! &
w
group_by(cyl) %>% MAKE NEW VARIABLES
summarise(avg = mean(mpg)) See ?base::Logic and ?Comparison for help.
Apply vectorized functions to columns. Vectorized functions take
vectors as input and return vectors of the same length as output
ARRANGE CASES (see back).
Use rowwise(.data, …) to group data into individual rows. dplyr vectorized function
arrange(.data, …, .by_group = FALSE) Order
functions will compute results for each row. Also apply functions
w
www
ww
rows by values of a column or columns (low to
to list-columns. See tidyr cheat sheet for list-column workflow. high), use with desc() to order from high to low. mutate(.data, …, .keep = "all", .before = NULL,
w
www
ww
arrange(mtcars, mpg) .after = NULL) Compute new column(s). Also
starwars %>% arrange(mtcars, desc(mpg)) add_column(), add_count(), and add_tally().
ww
www
ww
mutate(mtcars, gpm = 1 / mpg)
w
w
rowwise() %>%
mutate(film_count = length(films))
ADD CASES transmute(.data, …) Compute new column(s),
w
www
ww
Add one or more rows to a table.
ungroup(g_mtcars) add_row(cars, speed = 1, dist = 1) rename(.data, …) Rename columns. Use
w
wwww rename_with() to rename with a function.
rename(cars, distance = dist)
RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at dplyr.tidyverse.org • dplyr 1.0.7 • Updated: 2021-07
15
Vectorized Functions Summary Functions Combine Tables
TO USE WITH MUTATE () TO USE WITH SUMMARISE () COMBINE VARIABLES COMBINE CASES
mutate() and transmute() apply vectorized summarise() applies summary functions to x y
functions to columns to create new columns. columns to create a new table. Summary A B C E F G A B C E F G A B C
RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at dplyr.tidyverse.org • dplyr 1.0.7 • Updated: 2021-07
16
Data tidying with tidyr : : CHEAT SHEET
Tidy data is a way to organize tabular data in a
consistent data structure across packages. Reshape Data - Pivot data to reorganize values into a new layout. Expand
A table is tidy if:
A B C A B C
table4a Tables
country 1999 2000 country year cases pivot_longer(data, cols, names_to = "name", Create new combinations of variables or identify
& A
B
0.7K 2K
37K 80K
A
B
1999 0.7K
1999 37K
values_to = "value", values_drop_na = FALSE) implicit missing values (combinations of
C 212K 213K C 1999 212K "Lengthen" data by collapsing several columns variables not present in the data).
A 2000 2K
Each variable is in Each observation, or into two. Column names move to a new
B 2000 80K x
its own column case, is in its own row C 2000 213K names_to column and values to a new values_to x1 x2 x3 x1 x2 expand(data, …) Create a
column. A 1 3
B 1 4
A 1
A 2 new tibble with all possible
A B C A *B C pivot_longer(table4a, cols = 2:3, names_to ="year", B 2 3 B 1
B 2
combinations of the values
values_to = "cases") of the variables listed in …
Drop other variables.
table2 expand(mtcars, cyl, gear,
Access variables Preserve cases in country year type count country year cases pop pivot_wider(data, names_from = "name", carb)
as vectors vectorized operations A 1999 cases 0.7K A 1999 0.7K 19M
values_from = "value")
A 1999 pop 19M A 2000 2K 20M x
A 2000 cases 2K B 1999 37K 172M The inverse of pivot_longer(). "Widen" data by x1 x2 x3 x1 x2 x3 complete(data, …, fill =
Tibbles
A 1 3 A 1 3
A
B
2000
1999
pop 20M
cases 37K C
B 2000 80K 174M
1999 212K 1T
expanding two columns into several. One column B 1 4 A 2 NA list()) Add missing possible
B 1999 pop 172M C 2000 213K 1T provides the new column names, the other the B 2 3 B 1 4
combinations of values of
AN ENHANCED DATA FRAME
B 2 3
B 2000 cases 80K values. variables listed in … Fill
Tibbles are a table format provided B 2000 pop 174M
remaining variables with NA.
C 1999 cases 212K pivot_wider(table2, names_from = type,
by the tibble package. They inherit the complete(mtcars, cyl, gear,
C 1999 pop 1T values_from = count)
data frame class, but have improved behaviors: C 2000 cases 213K carb)
C 2000 pop 1T
• Subset a new tibble with ], a vector with [[ and $.
• No partial matching when subsetting columns.
• Display concise views of the data on one screen. Split Cells - Use these functions to split or combine cells into individual, isolated values. Handle Missing Values
options(tibble.print_max = n, tibble.print_min = m, table5 Drop or replace explicit missing values (NA).
tibble.width = Inf) Control default display settings. country century year country year unite(data, col, …, sep = "_", remove = TRUE,
x
View() or glimpse() View the entire data set.
A 19 99 A 1999
na.rm = FALSE) Collapse cells across several x1 x2 x1 x2 drop_na(data, …) Drop
A 20 00 A 2000
B 19 99 B 1999 columns into a single column. A 1 A 1
rows containing NA’s in …
CONSTRUCT A TIBBLE B NA D 3
B 20 00 B 2000
unite(table5, century, year, col = "year", sep = "") C
D
NA
3
columns.
tibble(…) Construct by columns. E NA drop_na(x, x2)
tibble(x = 1:3, y = c("a", "b", "c")) Both make table3
x
this tibble country year rate country year cases pop separate(data, col, into, sep = "[^[:alnum:]]+",
tribble(…) Construct by rows. x1 x2 x1 x2 fill(data, …, .direction =
A 1999 0.7K/19M0 A 1999 0.7K 19M remove = TRUE, convert = FALSE, extra = "warn", A 1 A 1
tribble(~x, ~y, A 2000 0.2K/20M0 A 2000 2K 20M B NA B 1 "down") Fill in NA’s in …
A tibble: 3 × 2 fill = "warn", …) Separate each cell in a column
1, "a", x y B 1999 .37K/172M B 1999 37K 172 C NA C 1
columns using the next or
<int> <chr> B 2000 .80K/174M B 2000 80K 174 into several columns. Also extract(). D 3 D 3
2, "b", 1 1 a
E NA E 3 previous value.
3, "c") 2
3
2
3
b
c
separate(table3, rate, sep = "/", fill(x, x2)
into = c("cases", "pop"))
x
as_tibble(x, …) Convert a data frame to a tibble. table3
country
A
year
1999
rate
0.7K x1 x2 x1 x2 replace_na(data, replace)
A 1 A 1
enframe(x, name = "name", value = "value") country year rate A 1999 19M
Specify a value to replace
A 1999 0.7K/19M0 A 2000 2K separate_rows(data, …, sep = "[^[:alnum:].]+", B NA B 2
Convert a named vector to a tibble. Also deframe(). C NA C 2
NA in selected columns.
A 2000 0.2K/20M0 A 2000 20M
convert = FALSE) Separate each cell in a column D 3 D 3
RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at tidyr.tidyverse.org • tibble 3.1.2 • tidyr 1.1.3 • Updated: 2021–08
17
Nested Data
A nested data frame stores individual tables as a list-column of data frames within a larger organizing data frame. List-columns can also be lists of vectors or lists of varying data types.
Use a nested data frame to:
• Preserve relationships between observations and subsets of data. Preserve the type of the variables being nested (factors and datetimes aren't coerced to character).
• Manipulate many sub-tables at once with purrr functions like map(), map2(), or pmap() or with dplyr rowwise() grouping.
RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at tidyr.tidyverse.org • tibble 3.1.2 • tidyr 1.1.3 • Updated: 2021–08
18
Data visualization with ggplot2 : : CHEAT SHEET
Basics Geoms Use a geom function to represent data points, use the geom’s aesthetic properties to represent variables.
Each function returns a layer.
ggplot2 is based on the grammar of graphics, the idea
that you can build every graph from the same GRAPHICAL PRIMITIVES TWO VARIABLES
components: a data set, a coordinate system, a <- ggplot(economics, aes(date, unemploy)) both continuous continuous bivariate distribution
and geoms—visual marks that represent data points. b <- ggplot(seals, aes(x = long, y = lat)) e <- ggplot(mpg, aes(cty, hwy)) h <- ggplot(diamonds, aes(carat, price))
RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at ggplot2.tidyverse.org • ggplot2 3.3.5 • Updated: 2021-08
19
Stats An alternative way to build a layer. Scales Override defaults with scales package. Coordinate Systems Faceting
A stat builds new variables to plot (e.g., count, prop). Scales map data values to the visual values of an r <- d + geom_bar() Facets divide a plot into
fl cty cyl aesthetic. To change a mapping, add a new scale. r + coord_cartesian(xlim = c(0, 5)) - xlim, ylim subplots based on the
n <- d + geom_bar(aes(fill = fl)) The default cartesian coordinate system. values of one or more
+ =
x ..count..
discrete variables.
aesthetic prepackaged scale-specific r + coord_fixed(ratio = 1/2)
scale_ to adjust scale to use arguments ratio, xlim, ylim - Cartesian coordinates with t <- ggplot(mpg, aes(cty, hwy)) + geom_point()
data stat geom coordinate plot
x=x· system n + scale_fill_manual( fixed aspect ratio between x and y units.
y = ..count.. values = c("skyblue", "royalblue", "blue", "navy"), t + facet_grid(cols = vars(fl))
Visualize a stat by changing the default stat of a geom limits = c("d", "e", "p", "r"), breaks =c("d", "e", "p", “r"), ggplot(mpg, aes(y = fl)) + geom_bar() Facet into columns based on fl.
name = "fuel", labels = c("D", "E", "P", "R")) Flip cartesian coordinates by switching
function, geom_bar(stat="count") or by using a stat
x and y aesthetic mappings. t + facet_grid(rows = vars(year))
function, stat_count(geom="bar"), which calls a default range of title to use in labels to use breaks to use in
values to include legend/axis in legend/axis legend/axis Facet into rows based on year.
geom to make a layer (equivalent to a geom function). in mapping
Use ..name.. syntax to map stat variables to aesthetics. r + coord_polar(theta = "x", direction=1)
theta, start, direction - Polar coordinates. t + facet_grid(rows = vars(year), cols = vars(fl))
GENERAL PURPOSE SCALES Facet into both rows and columns.
geom to use stat function geommappings r + coord_trans(y = “sqrt") - x, y, xlim, ylim t + facet_wrap(vars(fl))
Use with most aesthetics Transformed cartesian coordinates. Set xtrans
i + stat_density_2d(aes(fill = ..level..), Wrap facets into a rectangular layout.
scale_*_continuous() - Map cont’ values to visual ones. and ytrans to the name of a window function.
geom = "polygon")
variable created by stat scale_*_discrete() - Map discrete values to visual ones. Set scales to let axis limits vary across facets.
scale_*_binned() - Map continuous values to discrete bins. π + coord_quickmap()
60
π + coord_map(projection = "ortho", orientation t + facet_grid(rows = vars(drv), cols = vars(fl),
c + stat_bin(binwidth = 1, boundary = 10) scale_*_identity() - Use data values as visual ones. = c(41, -74, 0)) - projection, xlim, ylim scales = "free")
lat
x, y | ..count.., ..ncount.., ..density.., ..ndensity.. scale_*_manual(values = c()) - Map discrete values to Map projections from the mapproj package x and y axis limits adjust to individual facets:
manually chosen visual ones.
c + stat_count(width = 1) x, y | ..count.., ..prop.. long
(mercator (default), azequalarea, lagrange, etc.). "free_x" - x axis limits adjust
scale_*_date(date_labels = "%m/%d"), "free_y" - y axis limits adjust
c + stat_density(adjust = 1, kernel = "gaussian") date_breaks = "2 weeks") - Treat data values as dates.
x, y | ..count.., ..density.., ..scaled..
e + stat_bin_2d(bins = 30, drop = T)
scale_*_datetime() - Treat data values as date times.
Same as scale_*_date(). See ?strptime for label formats.
Position Adjustments Set labeller to adjust facet label:
t + facet_grid(cols = vars(fl), labeller = label_both)
x, y, fill | ..count.., ..density.. Position adjustments determine how to arrange geoms fl: c fl: d fl: e fl: p fl: r
X & Y LOCATION SCALES that would otherwise occupy the same space.
e + stat_bin_hex(bins = 30) x, y, fill | ..count.., ..density.. t + facet_grid(rows = vars(fl),
Use with x or y aesthetics (x shown here) s <- ggplot(mpg, aes(fl, fill = drv)) labeller = label_bquote(alpha ^ .(fl)))
e + stat_density_2d(contour = TRUE, n = 100)
x, y, color, size | ..level.. scale_x_log10() - Plot x on log10 scale. ↵c ↵d ↵e ↵p ↵r
scale_x_reverse() - Reverse the direction of the x axis. s + geom_bar(position = "dodge")
e + stat_ellipse(level = 0.95, segments = 51, type = "t") scale_x_sqrt() - Plot x on square root scale. Arrange elements side by side.
l + stat_contour(aes(z = z)) x, y, z, order | ..level..
l + stat_summary_hex(aes(z = z), bins = 30, fun = max) COLOR AND FILL SCALES (DISCRETE)
s + geom_bar(position = "fill")
Stack elements on top of one
Labels and Legends
x, y, z, fill | ..value.. another, normalize height. Use labs() to label the elements of your plot.
n + scale_fill_brewer(palette = "Blues")
l + stat_summary_2d(aes(z = z), bins = 30, fun = mean) For palette choices: e + geom_point(position = "jitter") t + labs(x = "New x axis label", y = "New y axis label",
x, y, z, fill | ..value.. RColorBrewer::display.brewer.all() Add random noise to X and Y position of title ="Add a title above the plot",
each element to avoid overplotting. subtitle = "Add a subtitle below title",
f + stat_boxplot(coef = 1.5) n + scale_fill_grey(start = 0.2, A caption = "Add a caption below plot",
x, y | ..lower.., ..middle.., ..upper.., ..width.. , ..ymin.., ..ymax.. end = 0.8, na.value = "red") e + geom_label(position = "nudge") alt = "Add alt text to the plot",
B
Nudge labels away from points. <aes> = "New <aes>
<AES> <AES> legend title")
f + stat_ydensity(kernel = "gaussian", scale = "area") x, y
| ..density.., ..scaled.., ..count.., ..n.., ..violinwidth.., ..width.. COLOR AND FILL SCALES (CONTINUOUS) s + geom_bar(position = "stack") t + annotate(geom = "text", x = 8, y = 9, label = “A")
Stack elements on top of one another. Places a geom with manually selected aesthetics.
e + stat_ecdf(n = 40) x, y | ..x.., ..y.. o <- c + geom_dotplot(aes(fill = ..x..))
e + stat_quantile(quantiles = c(0.1, 0.9), Each position adjustment can be recast as a function p + guides(x = guide_axis(n.dodge = 2)) Avoid crowded
o + scale_fill_distiller(palette = “Blues”) with manual width and height arguments: or overlapping labels with guide_axis(n.dodge or angle).
formula = y ~ log(x), method = "rq") x, y | ..quantile..
s + geom_bar(position = position_dodge(width = 1)) n + guides(fill = “none") Set legend type for each
e + stat_smooth(method = "lm", formula = y ~ x, se = T, o + scale_fill_gradient(low="red", high=“yellow") aesthetic: colorbar, legend, or none (no legend).
level = 0.95) x, y | ..se.., ..x.., ..y.., ..ymin.., ..ymax..
ggplot() + xlim(-5, 5) + stat_function(fun = dnorm,
o + scale_fill_gradient2(low = "red", high = “blue”,
mid = "white", midpoint = 25) Themes n + theme(legend.position = "bottom")
Place legend at "bottom", "top", "left", or “right”.
n = 20, geom = “point”) x | ..x.., ..y.. n + scale_fill_discrete(name = "Title",
ggplot() + stat_qq(aes(sample = 1:100)) o + scale_fill_gradientn(colors = topo.colors(6)) r + theme_bw() r + theme_classic() labels = c("A", "B", "C", "D", "E"))
x, y, sample | ..sample.., ..theoretical.. Also: rainbow(), heat.colors(), terrain.colors(), White background Set legend title and labels with a scale function.
cm.colors(), RColorBrewer::brewer.pal() with grid lines. r + theme_light()
e + stat_sum() x, y, size | ..n.., ..prop..
e + stat_summary(fun.data = "mean_cl_boot")
h + stat_summary_bin(fun = "mean", geom = "bar")
SHAPE AND SIZE SCALES
r + theme_gray()
Grey background
r + theme_linedraw()
r + theme_minimal()
Zooming
p <- e + geom_point(aes(shape = fl, size = cyl)) (default theme). Minimal theme. Without clipping (preferred):
e + stat_identity() p + scale_shape() + scale_size() r + theme_dark() r + theme_void() t + coord_cartesian(xlim = c(0, 100), ylim = c(10, 20))
e + stat_unique() p + scale_shape_manual(values = c(3:7)) Dark for contrast. Empty theme.
With clipping (removes unseen data points):
r + theme() Customize aspects of the theme such
as axis, legend, panel, and facet properties. t + xlim(0, 100) + ylim(10, 20)
p + scale_radius(range = c(1,6))
p + scale_size_area(max_size = 6) r + ggtitle(“Title”) + theme(plot.title.postion = “plot”) t + scale_x_continuous(limits = c(0, 100)) +
r + theme(panel.background = element_rect(fill = “blue”)) scale_y_continuous(limits = c(0, 100))
RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at ggplot2.tidyverse.org • ggplot2 3.3.5 • Updated: 2021-08
20
R Syntax Comparison : : CHEAT SHEET
Dollar sign syntax Formula syntax Tidyverse syntax
goal(data$x, data$y) goal(y~x|z, data=data, group=w) data %>% goal(x)
SUMMARY STATISTICS: SUMMARY STATISTICS: SUMMARY STATISTICS:
one continuous variable: one continuous variable: one continuous variable:
mean(mtcars$mpg) mosaic::mean(~mpg, data=mtcars) mtcars %>% dplyr::summarize(mean(mpg))
! Sometimes particular syntaxes work, but are considered formulas in base plots
dangerous to use, because they are so easy to get wrong. For Base R plots will also take the formula syntax, although it's not as commonly used
example, passing variable names without assigning them to a
named argument. plot(mpg~disp, data=mtcars)
RStudio® is a trademark of RStudio, Inc. • CC BY Amelia McNamara • [email protected] • @AmeliaMN • science.smith.edu/~amcnamara/ • Updated: 2018-01 24
Tabular reporting with flextable : : CHEAT SHEET
Basics Format BORDER
brdr <- fp_border(color = "#eb5555", width = 1.5)
Officer
The flextable package provides a framework GENERAL ft <- flextable(data)
for easily create tables for reporting and border_outer(ft, border = brdr)
get_flextable_defaults() : get flextable defaults fp_text() : Text formatting properties
publications. formatting properties color, font.size, bold, italic, underlined, font.family,
Functions are provided to let users create border_inner(ft, border = brdr)
set_flextable_defaults() : modify flextable vertical.align, shading.color
tables, modify, format and define their
defaults formatting properties border_inner_v(ft, border = brdr) fp_par() : Paragraph formatting properties
content.
init_flextable_defaults() : re-init all values text.align, padding, line_spacing, border,
flextable() with the package defaults border_inner_h(ft, border = brdr) shading.color, padding.bottom, padding.top,
padding.left, padding.right, border.bottom,
style(pr_t, pr_p, pr_c) : modify flextable text, ICONS border.left, border.top, border.right
border_remove(ft)
paragraphs and cells formatting properties fp_cell() : Cell formatting properties
(needs officer package)
vline_left(ft, border = brdr) border, border.bottom, border.left, border.top,
pr_t: object of class fp_text border.right, vertical.align, margin, margin.bottom,
data.frame flextable
pr_p object of class fp_par vline_right(ft, border = brdr) margin.top, margin.left, margin.right,
pr_c: object of class fp_cell background.color, text.direction
GENERAL FUNCTION’S STRUCTURE hline_top(ft, border = brdr) fp_border(): border properties object
TEXT
color, style, width
flextable object flextable part Abc font(ft, fontname = "Brush Script MT") hline_bottom(ft, border = brdr) update(x, args): update an object of class fp_*
Abc fontsize(ft, size = 7)
vline(ft, j=1:2, border = brdr)
italic(ft, italic = TRUE)
function(x, i, j, part, args) Abc
Abc bold(ft, bold = TRUE)
hline(ft, i = 1:2, border = brdr)
row & column Abc color(ft, color = "#eb5555")
selectors specific highlight(ft, color = "yellow")
Abc
arguments
Layout
Abc
[1] 2.25
$heights nrow_part(ft,
[1] 1.75 part = "body"):
$aspect_ratio 6
[1] 0.78
widths
width: 0.75
dim(ft):
$widths
A B C height: 0.25
0.75 0.75 0.75
$heights
[1] 0.25 0.25 0.25
0.25 0.25 0.25 0.25
dim_pretty(ft): MULTI CONTENT
$widths
FUNCTION COMPOSE ft <- flextable(data)
[1] 0.22 0.22 0.22 ft <- compose(ft, value = as_paragraph(
$heights compose(x, i, j, value = …, part, use_dot)
[1] 0.22 0.22 0.22 0.22 0.22 0.22 0.22 as_chunk("chunk"), chunk
autofit(ft, add_w = w, add_h = h) as_bracket("bracket") (bracket)
as_paragraph( Chunk 1 Chunk 2 Image 1 ) as_b("bold"), bold
w = 0, h = 0 w = 0.2, h = 0 highlight
as_hithlight("highlight", color = "yellow")
width: 0.22 width: 0.42 as_chunk(props), as_sub(), italic
as_i("italic"),
as_bracket(), as_sup(),
as_b(), colorize(), as_sub("sub"), sub
height: 0.22
as_highlight(), hyperlink_text(), as_sup("sup"), sup
ArData. • ardata.fr • Learn more at ardata-fr.github.io/flextable-book/ • package version 0.6.4 • Updated: 2021-03
26