A short introduction to R
I. White
September 2012
Introduction
R is an open source program for data manipulation, analysis and graphical display. It is
distributed under the terms of a general public license. The R home page is
http://www.r-project.org
These notes are based on the document ’An introduction to R’. This can be accessed from
the help menu or online (on R home page, click ‘manuals’), and can be consulted for more
details.
The aim of this first session in the statistics lab is a gentle introduction to R. Many students
find R difficult at first. The main thing at this stage is not to panic. You will have to become
reasonably familiar with the (UK) computer keyboard.
Work through the examples and exercises at your own pace. Where the notes show R
output, confirm the result. Where the notes show no output, attempt to obtain the output
yourself.
If you run out of time, skip to page 17 for instructions on closing R.
Preliminaries
In your ’Documents’ folder, create a new folder and call it week01. This is where you should
put any data files or source code. It is also where most objects and files will be saved. For
this session, you need just three files. Get these by pointing your browser at
http://homepages.ed.ac.uk/imsw/msc/data
Download brains.csv, sheep.R, and qsolve.R to the new folder. On the desktop the file
names may be displayed as brains, sheep, and qsolve, but in R the names must be given
in full.
1
Starting R
Follow the sequence Start, All Programs, R, R 2.13.1.
An introductory message should appear in the command window. The window has a prompt
(>) where commands should be typed. In the rest of this document, this symbol at the start
of a line represents the prompt, and should not be typed.
The working folder
The ’working’ folder or directory is where R looks for external data files. To specify the
working folder, type setwd("m:/week01") at the command prompt. You can use single
(’) or double (”) quotes. On teaching lab machines, m: is a network drive mapped to your
‘Documents’ folder.
Check that the downloaded files are visible to R with
> list.files()
Getting help
For help on any command, for example list.files, type
> help(list.files)
Alternatively, type ?list.files at the prompt. Sometimes quotes are required:
> help("+")
If the help information seems rather technical, the examples (usually at the end) may be
more useful.
Arrow keys
The up and down arrow keys on the keyboard scroll through previous commands, which can
be edited. Move cursor left or right with the left and right arrow keys. Characters can be
inserted by typing the appropriate key, or removed with the Delete key. Press the ’Enter’
key to re-activate the command.
This can save a great deal of typing.
2
Expressions and assignments
An expression is evaluated, printed, and the value is lost. E.g.
> 2 + 2
[1] 4
(The [1] at the start of the line is not part of the result and can be ignored.)
An assignment evaluates an expression and stores the value in a named variable, using the
<- symbol (< followed immediately by -) . The result is not printed. E.g.
> total <- 2 + 2
The command
> total
[1] 4
is required to print the value.
Vector arithmetic
In R, a vector is an ordered collection of numbers. To create vector x, consisting of the five
numbers 10.4, 5.6, 3.1, 6.4 and 21.7, type
> x <- c(10.4,5.6,3.1,6.4,21.7)
The function c() concatenates (collects) the five numbers. Arithmetic on vectors is per-
formed element by element. With x as above and
> y <- c(7,2.1,4.6,3.6,1)
the expression
> x + y
[1] 17.4 7.7 7.7 10.0 22.7
adds 10.4 + 7, 5.6 + 2.1, etc.
Vectors used in an expression need not be the same length. Short vectors are recycled
until they match the length of the longest vector. With x as above, x + 1 recycles 1 five
times.
> x + 1
[1] 11.4 6.6 4.1 7.4 22.7
3
The main arithmetic operators are +, -, * (multiply), / (divide). Note that
> 1 + 2/4
[1] 1.5
is not the same as
> (1+2)/ 4
[1] 0.75
To raise to a power,
> 2^3
[1] 8
for example raises 2 to the power 3.
With x as above, try
> x^2
> 3*x^2 - 2*x + 1
Matrices
Several vectors of the same length can be combined to form a matrix, an array of numbers
arranged in rows and columns. For example,
> m <- matrix(c(x,y), nrow=5, ncol=2)
c(x,y) is a vector of length 10 comprising the 5 values of x followed by the five values of
y. By default, the matrix is filled by column rather than by row, i.e., the values of x go in
column 1, the values of y in column 2.
Another way to create the same matrix is
> m <- cbind(x,y)
See help(matrix), help(cbind).
Functions
Functions such as log, exp, cos, sqrt have their usual meaning and operate element by
element. For example, log(x) is a vector with the same number of values as x.
sort(x) sorts the elements of x into increasing order.
4
Functions such as length operate on vectors but produce a single value:
length(x) gives number of elements in x
sum(x) gives the total of the elements in x
mean(x), var(x) calculate mean and variance of the values in x
Try
> log(x)
> sqrt(x)
> length(x)
> max(x)
> sum(x)
Other functions produce a display, e.g.,
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.10 5.60 6.40 9.44 10.40 21.70
Some functions operate on matrices:
> rowSums(m)
[1] 17.4 7.7 7.7 10.0 22.7
> colSums(m)
[1] 47.2 18.3
Sequences of numbers
We often need sequences of regularly spaced numbers.
The expression 1:5, or seq(1,5), is shorthand for the sequence
[1] 1 2 3 4 5
The rep() function expands a sequence in various ways:
> rep(1:3, times=2)
generates the sequence
[1] 1 2 3 1 2 3
and
> rep(1:3,each=2)
5
generates
[1] 1 1 2 2 3 3
See help(seq), help(rep) for details.
Logical vectors
The elements of logical vectors are either FALSE or TRUE.
For example, with x as above, typing x > 10 at the prompt produces
[1] TRUE FALSE FALSE FALSE TRUE
Other logical operators are < (less than), <= (less than or equal to), etc. The expression
!x>10 represents the ’negation’ of x>10, i.e. x <= 10.
> !(x > 10)
[1] FALSE TRUE TRUE TRUE FALSE
Missing values
A missing value in a vector is indicated by the special value NA, which acts as a ’placeholder’.
Any arithmetic operation on NA produces NA. E.g., if vector x has one or more missing values,
the result of sum(x) is NA.
To obtain the sum of the non-missing values, use
> sum(x, na.rm=TRUE)
Try this with
> x <- c(1,2,3,NA,5)
Note (i) this overwrites the previous version of x, (ii) the matrix m, created from x and y, is
unchanged: its first column remains the value of x when m was created.
The function is.na(x) produces a logical vector with value TRUE if the corresponding
element in x is NA, and FALSE otherwise.
> is.na(x)
[1] FALSE FALSE FALSE TRUE FALSE
6
Selecting subsets of observations
Sometimes we need to do a calculation on a subset of the observations. Subsets are selected
by using an index vector inside square brackets.
The index vector is usually a logical vector, of the same length as the vector from which
elements are selected. Values corresponding to TRUE in the index vector are selected and
those corresponding to FALSE omitted. For example,
> x[!is.na(x)]
[1] 1 2 3 5
is a vector which contains just the non-missing values of x.
Alternatively, the index vector consists of positive integers, giving the positions of elements
to be selected. In this case the index vector can be of any length and the result is of the
same length as the index vector. For example, x[1:3] selects the first three elements of x,
while x[5] selects just the fifth element. Subsets of a matrix:
> m[2,1]
displays the number in the second row and first column of m. Also try
> m[2:3,]
(the second and third rows of m), and
> m[,1]
(the first column of m).
Objects
R has various types of object: vectors, matrices, lists, data frames, functions. Names of
objects should not include spaces. R is case sensitive: Total and total are different objects.
The command
> ls()
displays the names of objects created during the R session, and
> ls.str()
gives more details about each object.
You are now approximately half-way through the material, and can tidy up by removing all
objects created so far. Check what these are with ls() or ls.str(). To remove objects
use rm:
7
> rm(x, y, m, total)
(Include in the list any extra objects you have created.)
Alternatively, the command
> rm(list=ls())
removes everything, so use with care. Check once more with
> ls()
When you have removed all objects the response is the slightly mysterious ’character(0)’.
Factors
The file sheep.R contains a few lines of R code such as might be typed at the R prompt.
Take a look at the file:
> file.show("sheep.R")
You should see the following:
Breed <- factor(rep(c("Black","Welsh","Cross"),c(5,5,6)))
Cu <- c(6.5, 7.9, 7.4, 6.8, 8.1, 10.4, 9.8, 11.1, 10.6, 9.2,
6.9, 9.2, 8.4, 7.6, 9.7, 8.9)
The variable Cu represents measurements on 16 sheep. The first 5 sheep are Scottish
Blackface, the next 5 are Welsh Mountain, and the last 6 are Blackface-Welsh Cross. Breed,
also of length 16, identifies the breed for each of the 16 animals. Breed is a factor : a
numerical or character vector used to classify (categorize) the elements of another vector
(Cu).
This arrangement is often used to represent data of this form. It is easily extended to allow
for multiple classification (by breed and sex, for example) by introducing more factors.
To run the code in the file,
> source("sheep.R")
Data Cu and Breed should now be in your R workspace: check with
> ls.str()
Check the factor values:
> Breed
[1] Black Black Black Black Black Welsh Welsh Welsh Welsh Welsh
[11] Cross Cross Cross Cross Cross Cross
Levels: Black Cross Welsh
8
To split the data into subsets by breeds,
> breed.list <- split(Cu, Breed)
> breed.list
$Black
[1] 6.5 7.9 7.4 6.8 8.1
$Cross
[1] 6.9 9.2 8.4 7.6 9.7 8.9
$Welsh
[1] 10.4 9.8 11.1 10.6 9.2
The output of the split() function is a list. Components of lists usually have names. In
the example, the components are named by breed, so that, for example, breed.list$Black
is the vector of five measurements on the Blackface sheep. Note that component Black of
list breedlist is accessed as breed.list$Black.
Calculation by subsets
One way to calculate the mean value of Cu for each breed is to begin with
> mean(Cu[Breed == "Black"])
[1] 7.34
and repeat for the other two breeds. There are two quicker ways to do this.
lapply() applies a function to each component of a list:
> lapply(breed.list, mean)
$Black
[1] 7.34
$Cross
[1] 8.45
$Welsh
[1] 10.22
Alternatively, use tapply():
> tapply(Cu, Breed, mean)
Black Cross Welsh
7.34 8.45 10.22
9
This creates a subset of Cu defined by each value of Breed and applies the specified function
(mean) to each subset.
To calculate variances instead of means:
> tapply(Cu, Breed, var)
A related function is table, which counts the number of observations for each level of the
factor. For details, see help(lapply), help(tapply), help(table).
Data frames
A data frame is a list with components which are vectors all of the same length.
A data frame may also be regarded as a matrix with columns corresponding to the com-
ponents. It is displayed in matrix form, and its rows and columns can be extracted using
matrix subsetting (described above).
Creating data frames
Data frames can be created with the function data.frame(). For example, if Breed is a
factor, and Cu is a vector, both with 16 values, as above,
> sheep <- data.frame(Breed,Cu)
creates a data frame (sheep) with two columns (components), sheep$Breed and sheep$Cu.
Applied to a data frame, the summary() function provides a summary each column, taking
account of its type (e.g. factor or numerical variable).
> summary(sheep)
Other useful functions for data frames:
> subset(sheep, Breed == "Black")
> page(sheep, method = "print")
> ls.str(sheep)
Another way to create a data frame is to use read.csv() to read contents from an external
file. See below.
Attaching and detaching data frames
The attach() function makes the components of a list or data frame visible under their
component name. E.g., after
10
> attach(sheep)
sheep$Breed can be accessed simply as Breed. However, Breed is a copy of sheep$Breed
and any change made to Breed does not affect the data frame.
To detach a list or data frame, use the detach function, e.g.
> detach(sheep)
Avoid attaching a data frame more than once. It is prudent to detach the data frame as
soon as you have finished using its components.
Many R functions have a ’data=’ option which makes attaching and detaching data frames
unnecessary.
Reading data from files
Large data objects are usually read from external files created by an editor or spreadsheet
rather than entered at the keyboard. An entire data frame can be read directly with the
read.table() function. On this course we will usually assume that data is available in
‘comma separated’ format and use a special version of this function (read.csv). Data in
this format can be equally easily opened in R or spreadsheet program.
Reading a data frame
The first line of the data file should have a name for each variable. Each additional line of
the file should consist of a row label followed by values for each variable. Values on each
line should be separated by commas. Character values are usually quoted, and must be if
they include spaces.
Here are the first few lines of brains.csv.
"body.wt","brain.wt"
"African Elephant",6654,5712
"Asian Elephant",2547,4603
"Giraffe",529,680
"Horse",521,655
"Cow",465,423
read.csv() can then be used to read the data frame directly:
> mammals <- read.csv("brains.csv")
creates a data frame mammals with two variables, body.wt and brain.wt, and row labels
("African Elephant", etc).
11
In this example the row labels are obviously informative, but often they are not and are
omitted from the data file:
"body.wt","brain.wt"
6654,5712
2547,4603
529,680
521,655
465,423
In this case, rows would be given default labels 1, 2 . . .
See help(read.table).
Graphics
Functions such as plot() and hist() produce a graph complete with axes, labels and titles.
These commands start a new plot, erasing any existing plot.
The plot() function
The plot() function is generic. This means the type of plot produced depends on the class
of the first argument (vector, matrix, data frame, etc). In the examples which follow, x and
y are numerical vectors, f is a factor, and df is a data frame. Here we meet a new type of
R object: y ~ x is an example of an R formula.
plot(x,y)
produces scatterplot of y (on vertical axis) against x (on horizontal axis).
plot(f)
generates a bar plot of factor f.
plot(f,y)
produces boxplots of y (numeric vector) for each level of f (factor).
plot(df)
produces distributional plots for each variable in data frame df.
> plot(y ~ x)
and
> plot(y ~ f)
12
are the same as plot(x,y) and plot(f,y) but have a ’data=’ option.
hist(x) produces a histogram of the values in x.
stem(x) produces a ’stem and leaf’ plot, similar to a histogram. See help(stem).
Extra arguments for plot()
xlab="label" and ylab="label" create labels for the x and y axes.
The type of plot is controlled by the type option, e.g.
type="p" points (the default)
type="l" lines
Low-level plotting commands
These augment a high-level plot by adding points, lines or text:
points(x,y) Adds points to the current plot
lines(x,y) Adds lines to the current plot.
text(x, y, labels) Add text at points given by vectors x, y. labels is a character
vector of the same length as x and y.
abline(a,b) Adds a line with slope b and intercept a to the current plot.
abline(h=y) Add a horizontal line at y-coordinate y.
abline(v=x) Add a vertical line at x-coordinate x.
See help(plot.default)
Example
Average brain and body weights (in kg and g respectively) for 62 species of mammal are
given in brains.csv. If you have not already done so, create a data frame called mammals
with
> mammals <- read.csv("brains.csv")
Look at the first 10 rows of the data frame:
> mammals[1:10,]
13
The same result is obtained with head(mammals, n=10).
Look for a relationship between brain and body weight:
> plot(brain.wt ~ body.wt, data = mammals)
Can this be improved? Try log scales on x and y axes:
> plot(brain.wt ~ body.wt, data = mammals, log="xy")
Identify some interesting points:
> attach(mammals)
> identify(body.wt, brain.wt, labels=row.names(mammals))
Type the above in the command window then left-click on a few points in the graphics
window. Right click in the graphics window, choose ‘Stop’ to close.
> detach(mammals)
For another type of plot, try
> plot(Cu ~ Breed, data=sheep)
This produces a ’boxplot’ for each sheep breed.
For an overview of the graphics capabilities of R, try
> demo(graphics)
Writing your own function
R does most of its calculations using functions. Like any R object, a function is printed by
typing the name of the object. Try
> sd
This is a simple function to calculate a standard deviation, making use of another function
(var) which calculates a variance. We see that this R function is designed to deal with
vectors, matrices or data frames as input. The general form of a simple function is
function(x) { function body }
where the function body is a series of one or more R expressions and assignments. The
output of the function is the value of the final expression or assignment.
Occasionally we need to do a non-standard calculation for which an in-built function does
not exist. As an example, the following R function takes a single vector of three coefficients
(a, b, c) as input and solves the quadratic ax2 + bx + c = 0. The checks (if ...)
can be omitted but produce useful error messages when things go wrong. The value of the
function is a vector of length 2 containing the two solutions.
14
qsolve <- function (x)
{
if (length(x) != 3)
stop("Input must be single vector with 3 values")
a <- x[1]
b <- x[2]
c <- x[3]
test <- b^2 - 4 * a * c
if (test < 0)
stop("No real roots")
(sqrt(test) * c(-1, 1) - b)/(2 * a)
}
Source this code with
> source("qsolve.R")
You should now have a new object called qsolve in your workspace. Check with ls().
View the function by typing its name:
> qsolve
Now try
> qsolve(c(1,-3,2))
[1] 1 2
The roots of the quadratic equation x 2 − 3x + 2 = 0 are x = 1 and x = 2.
Minimising a function
Sometimes we need to find the value of x which produces the smallest possible value of a
function f (x). For example, given the numbers
> y <- c(16,22,21,20,23,21,19,15,13,23,17,20,29,18,22,16,25)
find the value of m which minimises the sum of squares
(16 − m)2 + (22 − m)2 + · · · + (25 − m)2
For this particular function, the answer is the mean of the 17 numbers (20.0), but for now
pretend that we do not know this. In R, create a function SSQ and use optimize to find
the value of m required:
> SSQ <- function(m) { sum( (y-m)^2 ) }
> res <- optimize(SSQ, interval = range(y))
> res
15
$minimum
[1] 20
$objective
[1] 254
This finds the value of m (confusingly labelled ‘minimum’) within the range of the data for
which SSQ(m) is a minimum. The minimum value of the function – in this case the sum
of squares about the mean – is also given (labelled ‘objective’). Sums of squares play an
important role in regression and analysis of variance calculations.
Use plot to see the quadratic curve (sapply is a version of lapply that works with vectors
as well as lists).
> mvals <- seq( min(y), max(y), length = 21)
> SSQvals <- sapply( mvals, SSQ )
> plot(mvals, SSQvals, type = "l")
> abline(v = res$minimum, h= res$objective, lty = "dashed")
Pull-down menus
In R, results are nearly always obtained by typing a command at the the command window
prompt. For a small subset of commands, an alternative is provided by the drop-down menus
at the top of the screen (File, Edit, etc). The following table gives the alternative for some
of the commands used in this document.
Command window Pull-down menu
setwd("m:/week01") File Change dir...
ls() Misc List objects
rm(list=ls()) Misc Remove all objects
file.show("sheep.R") File Display file(s)...
source("sheep.R") File Source R code
q() File Exit
Saving output
You can save a record of the current session. Typing
> savehistory("session.R")
saves a record of all commands used in the current session (choose your own name for the
file). The file can be edited to remove commands that did not work, mistypings, etc. What
remains (if anything does) is a useful reminder of what you did and how you did it.
16
On Windows, a transcript of the session (commands and output) can be saved via the
pull-down menu File, Save to File....
Closing R and saving the workspace
To close R, type q() at the command prompt. You will be given the option of saving your
work. If you choose to save, the contents of the workspace (R objects created and not
deleted during this session) will be saved in a binary file (.RData) which (on Windows) the
system recognises and displays with an R icon. Clicking this icon restarts R with the saved
objects reloaded. Alternatively, specify the working folder (see page 2) then type
> load(".RData")
Save work that you intend to revisit in the future. Today’s workspace does not need to be
saved. If you do save workspace, R work on other data (e.g. on a different project) should
use a new working folder. For example, if you save this week’s work, create a new folder
(week02) before you begin next week’s work.
more . . .
These notes give a very brief introduction to some basic capabilities of R. Later we will learn
about functions for regression and analysis of variance (lm), maximum likelihood estimation
(mle), and significance testing (t.test, etc). There is much more which is beyond the
scope of this course.
Some reading
Venables, W.N. and B.D. Ripley, Modern Applied Statistics with S
Thorough, wide-ranging, expensive. S is the proprietary software that inspired R. Books on
R and S are almost interchangeable.
Crawley, M.J. The R Book
Big and weighty (literally). Lacks the depth of Venables and Ripley. Copy in QGGA room.
Dalgaard, P. Introductory Statistics with R
Short introduction to R and basic statistics. Examples mostly medical.
17