Reading in data locally and from the web
Reading data is the gateway for any data analysis.
Data can be read from local device or from web.
In R, “Reading” or “loading” is the process of converting data (stored as plain text, a
database, HTML, etc.) into an object (e.g., a data frame)
There are many ways to store data as well as many ways to read them.
Different functions are available in R to import data from various file formats.
While loading a data set into R, we need to tell R where those files live. The file could live
on your computer (local) or somewhere on the internet (remote).
The place where the file lives on your computer is called the “path.”
There are two kinds of paths: relative paths and absolute paths.
A relative path is where the file is with respect to our current computer.
An absolute path is where the file is in respect to the computer’s file system.
As per the figure,
o We are working in a file named worksheet_02.ipynb .
o If we want to read the .csv file named happiness_report.csv into R, we could do this
using either a relative or an absolute path.
Reading happiness_report.csv using a relative path
happy_data <- read_csv("data/happiness_report.csv")
Reading happiness_report.csv using an absolute pat:
happy_data <- read_csv("/home/dsci-100/worksheet_02/data/happiness_report.csv")
In case of remote files, a Uniform Resource Locator (URL) (web address) indicates the
location of a file/resource.
Reading tabular data from a plain text file into R
read_csv() to read in comma-separated files (csv file)
data <- read_csv("data/xyz.csv")
Data filename is “xyz.csv” stored under “data” folder.
read_tsv to read in tab-separated files
data <- read_tsv("data/xyz.tsv")
Reading tabular data directly from a URL
read_csv( ), read_tsv( ), read_delim( ) functions are used to read in data directly from
a Uniform Resource Locator (URL) that contains tabular data.
url <- "https://xxx.com/data/xyz.csv"
data <- read_csv(url)
Reading tabular data from a Microsoft Excel file
data <- read_excel("data/xyz.xlsx")
Reading data from a database
Relational database is a common form of data storage for large data sets or multiple
users working on a project.
There are many relational database management systems, such as SQLite, MySQL,
PostgreSQL, Oracle and many more.
Reading data from a SQLite database
o SQLite database is self-contained and usually stored and accessed locally.
o Data is usually stored in a file with a .db extension.
o To read data into R from a database we need to connect the database.
o dbConnect( ) function is used from the DBI (database interface) package to
connect the database.
data <- dbConnect(RSQLite::SQLite(), "data/xyz.db")
o Relational databases may have many tables. In order to retrieve data from a
database, we need to know the name of the table in which the data is stored.
o We can get the names of all the tables in the database using
the dbListTables function:
tables <- dbListTables(conn_lang_data)
Obtaining data from the web using API
Accessing data stored in a plain text, spread sheets, comma or tab separated files from a
web URL using one of the read_* functions from the tidyverse.
Now websites use Application Programming Interface (API), which provides a
programmatic way to read data set.
This allows the website owner to control who has access to the data, what portion of the
data they have access to, and how much data they can access.
We can collect data programmatically - in the form of Hypertext Markup Language
(HTML) and Cascading Style Sheet (CSS) code - and process it to extract useful
information.
HTML provides the basic structure of a site and CSS helps style the content.
What is Tidy Data?
In a Data Science project, tidying data is a necessary after importing data in order to
communicate results.
Tidy datasets provide a standardized way to link the structure of a dataset (its physical
layout) with its semantics (its meaning).
o Structure is the form and shape of data. In statistics, most datasets are rectangular
data tables(data frames) and are made up of rows and columns.
o Semantics is the meaning for the dataset. Datasets are a collection of values,
either quantitative or qualitative. These values are organized in 2 ways —
variable & observation.
Variables — all values that measure the same underlying attribute across units
Observations — all values measured on the same unit across attributes
o The 3 rules of tidy data help simplify the concept and make it more intuitive.
Each variable is a column
Each observation is a row
Each type of observational unit is a table
Messy Data
Messy data is any kind of data that does not follow the above framework.
To narrow it down, the paper gives 5 common problems of messy data:
o Column headers are values, not variable names.
o Multiple variables are stored in one column.
o Variables are stored in both rows and columns.
o Multiple types of observational units are stored in the same table.
o A single observational unit is stored in multiple tables.
Why is Tidy Data important?
If the data set is in standardized framework then we spend less time on data cleaning and
wrangling and more time to focus on answering the problem.
It is a good practice to have the data in a format which makes it reproducible and easy for
others to understand.
Another more technical reason is that the concept of tidy data is complemented with the tools
in R to work with. Since R works with vectors of values (R functions are vectorized by nature),
we able to naturally apply our tidy data to the tools used.