Lecture 4: R-Blocks
Data Frames
Ivan Belik
Assembled and built based on:
https://openlibra.com/en/book/download/an-introduction-to-r-2
R: Data Frames
• Data frame is a two dimensional data structure.
• It is similar to lists, BUT all components in the data frame are of equal length.
In data frames:
• Each component forms the column
• Content of the component (i.e. column) forms the rows
• Consider the built-in R data frame “BOD”:
R: Data Frames
• Let’s start with loading some datasets.
• The easiest way to load data is to use the data() function.
• If entered without arguments, it will bring up a list of all datasets that come bundled with R.
R: Data Frames
• To load a data frame, simply add its name as an argument of the data()-function
• For example, to load built-in data frame BOD:
• Running the class() function, we can verify that BOD is indeed a data frame object:
R: Data Frames
• We can find names of components with the names() function:
• “BOD” – built-in R data frame:
• “IRIS” - data frame from built-in package “datasets”:
To access all datasets known to R type
R: Data Frames
• We can also check some characteristics:
• dim() – the dimension of data frame
• nrow() – the number of observations
• ncol() – the number of variables
R: Data Frames
• To download data frame use save() – function:
• Note:
• Depending on your R installations, the alternative way to specify Address can be as following ( use \\ )
save(BOD, file = "C:\\Files\\BOD.Rdata")
R: Data Frames
• To upload data frame to R use load() – function:
• If you are MAC-user, you can find how to specify file address at the following link:
• http://rischanlab.github.io/SaveLoad.html
• It will be similar to the following:
R: Data Frames
• Frequently, you should create or download data working with “external” software:
• One of the most common data formats is .csv
• CSV is a comma separated values file
• CSV format is supported by the most popular data analytics tools including R
• It can be created in EXCEL or even in a trivial text editor
• A typical .csv file looks as following:
R: Data Frames
• Assume that we have a my_csv.csv file at the following location:
"C:\\Files\\my_csv.csv"
• We can upload it to R using the command read.csv():
Note for Mac users:
File path can start with ~/
For example: "~/Desktop/my_csv.csv" If “my_csv.csv” is located in ‘Desktop’-folder
R: Data Frames
• read.csv() – function can take the following arguments:
R: Data Frames
• We can also download data frames from Internet if we have a valid link:
https://data.ssb.no/api/v0/dataset/85430.csv?lang=en
• For example, we can load csv file (“Statistics Norway”, www.ssb.no):
R: Data Frames
• You can also download the csv file to the specific folder
• So that you have a physical copy on your disk
download.file(url = "https://data.ssb.no/api/v0/dataset/85430.csv?lang=en",
destfile = "C:/Files/data_from_web.csv")
• Many warnings can appear since the data from the web is frequently not well-formatted
• Then, the downloaded csv-file will appear in the specified location:
• "C:/Files/data_from_web.csv”
• Note: you location address will be different since you will save to your local folder on your PC
• Later you can load the downloaded file with read.csv() function (see the previous slides)
R: Data Frames
• Once again, you can read csv-file from your local drive:
• R opens my_csv.csv:
• This list has three variables (Col1, Col2, Col3) and corresponding data (4 observations for each variable)
R: Data Frames
• Now, I can do some manipulations with the imported csv file and write all changes back to my_csv.csv
row.names=FALSE:
required to avoid adding new
row names into the updated csv-
file (see next slide)
R: Data Frames
• If we ignore row.names=FALSE then we will get the following result:
R: Data Frames
• we can create a data frame based on vectors:
# 1. Create DataFrame from vectors
employee <- c('Lasse Lien', 'Eirik Knudsen', 'Ivan Belik')
salary <- c(25300, 24400, 23800)
start_date <- as.Date(c('2020-3-1','2019-3-25','2018-3-14'))
employ_data <- data.frame(employee, salary, start_date)
employ_data
• we can create a data frame based on matrices:
# 2. Create DataFrame from Matrix
x <- matrix(data=1:9, nrow=3, ncol=3)
y <- data.frame(x)
y
Data Frame Manipulations
R: Manipulating Data Frames
• Consider the built-in data frame “IRIS” for our future work:
> data ("iris")
• The first thing to note about data frames:
• It is usually not terribly helpful to print the object to the screen
• R will literally printed all the data to the screen
(It is OK if you are dealing
with small data frames)
• Most of the data frames are huge
and it is important to explore their structure
R: Manipulating Data Frames
• First, the best thing to do is to use the names() and the dim() functions
• You will get all variable names of the data frame
• Also, you will have some idea about the size of the data frame
5 variables (columns) in iris
5 columns (variables), 150 rows (observations) in iris
R: Extraction
• Most of the basic extraction principles – namely the [ ] – that we used for matrices will also work for data frames.
• But you should remember that data frames are a special type of lists
• It means that we can use $ to retrieve data.
• Let’s consider the built-in “iris”-data frame and its columns (i.e., variables):
> data ("iris")
> names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
• For example, lets try to extract the variable called "Petal.Width" (see next slide)
R: Extraction
• We know that Petal.Width is the 4-th variable in “iris”-data frame
> names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
• You can combine the $-notation and the [ ] notation since the $ extracts a vector,
and vectors are indexed via [ ]
R: Extraction
• You can manipulate the data frame’s variables in the same way as with vectors
R: The with() function
• It might be not convenient to use $-notation when you have to use many variables of the same data frame
• R has a function that makes things easier
• Whenever you need to extract or index multiple variables
and
you don’t feel like typing dataset$variable.name each time, just use with() function:
R: Subsetting
• The subset( ) function is the easiest way to select variables and observations:
• Here, we retrieve iris-data with Sepal.Length in open interval (7.0, 7.3)
R: Subsetting
• The subset( ) function can filter data combining different variables from the data frame:
R: Editing
• To edit you data manually as a spreadsheet, you can use edit() function
• Type edit(iris) and you will get the following basic spreadsheet (R data editor):
• It is not very stable. Please, use some other tools (like MS Excel) for this purpose if you have a choice
R: Editing
• Let’s do some manual data entry.
• We will use the $-operator to create a new (the 6-th) variable in the “iris” - data frame:
R: Function transform()
• To edit or transform a number of variables at once, the transform() function can be used:
R: More about objects and modes
• To detect the type of object you are dealing with use class() function
• First, let’s extract six components of the iris data frame
• Second, apply class() to each of them and check the result
• factors (for ex., Species) are special vectors that contain an attribute called level
• They are different form numeric vectors (see next slide)
R: More about objects and modes
• To make it clear, print Sepal_Length and Species:
• In the given example we can see that printing each object produces different results:
R: More about objects and modes
• Sepal_Length is a trivial numeric vector
• Runing Sepal_Length we return all its elements
• Species is a factor
• Factor is a special type of vector that contains categorical data (see the red “box” below):
• Frequently, it is important to now Levels when you work with complex statistical modeling using R
R: Changing object types
• R is capable of changing object type:
• For example:
• Functions listed above are well described in R documentation
• Check documentation when you need to use any of these functions
R: Data Summaries
• Summaries can be computed with functions that work with vectors and matrices:
• You can also apply summary() function to the entire data frame:
R: Data Summaries
• Another useful function is table()
• It combines data into subsets and shows the frequency of each element:
• The str() function summarizes the structure of a dataset
R: Data Summaries
• To get a more detailed summary of data frames, you can load the Hmisc package.
• You may have to install it first:
R: Data Summaries
• After we installed package Hmisc we import it to R working space and run describe() function (from Hmisc) for the
detailed summary (including quantiles):