Al.I.
Cuza University of Iai
Faculty of Economics and Business Administration
Department of Accounting, Information Systems and
Statistics
Data Analysis & Data
Science with R
Data Input and Output
(Import and Export)
By Marin Fotache
Scripts associated with this
presentation
Scripts
03a_basics_of_data_input_output.R:
http://1drv.ms/1DRMOTC
03b_intermediate_data_input_output:
http://1drv.ms/1JKeKwi
PostgreSQL
scripts (for creating the DB to be imported
in R see script 03a...)
00a_1_creating_tables__sales.sql:
http://1drv.ms/1JKeRbh
00a_2_populating_tables__sales.sql:
http://1drv.ms/1JKeYU0
01_creare_bd_vinzari_PostgreSQL.sql:
http://1drv.ms/1JKf522
02_populare_bd_vinzari_PostgreSQL .sql:
http://1drv.ms/1JKfem5
Scripts associated with this
presentation (cont.)
Oracle
scripts (for creating the DB to be imported in R
see script 03b...)
01-01a_creating_tables__sales.sql
http://1drv.ms/1LBNImg
01-01a_ro_creare_bd_vinzari.sql
http://1drv.ms/1AqQTvg
01-01b_populating_tables__sales.sql
http://1drv.ms/1LBNGLe
01-01b_ro_populare_bd_vinzari.sql
http://1drv.ms/1A5a60K
Web sites with R tutorials for data
input/output
R
Data Import/Export
http://cran.r-project.org/doc/manuals/r-re
lease/R-data.html
Beginner's guide to R: Get your data
into R
http://www.computerworld.com/article/2497
164/business-intelligence/beginner-s-guid
e-to-r-get-your-data-into-r.html
Reading/Writing Data: Part 1
https://www.youtube.com/watch?v=aBzA
els6jPk&index=9&list=PLjTlxb-wKvXNSD
fcKPFH2gzHGyjpeCZmJ
Web sites with R tutorials for data
input/output (cont.)
Importing
Data Into R from Different
Sources
http://www.r-bloggers.com/importing-data-i
nto-r-from-different-sources/
Data Import & Export in R
http://science.nature.nps.gov/im/datamgmt
/statistics/r/fundamentals/index.cfm
Reading data from the new version of
Google Spreadsheets
http://blog.revolutionanalytics.com/2014/
06/reading-data-from-the-new-version-of-g
Loading data into statistical
packages
Traditional solutions:
Direct import from external data files (Excel, CSV, text files etc.) using
their menus
Save intermediate results from the data sources into common format
files (XML, CSV, JSON ) and then import these intermediate files into
the package;
Create data sources using ODBC or JDBC
Some
more recent options:
Customized (for data source and the destination package) ETL
procedures
Connecting to special APIs or web/data services which provide data sets
in formats easy to import (e.g. Google Analytics)
Import data from web servers log into NoSQL data stores
Performing database query in a database server directly from the
statistical package.
Sources of Data in R (adaptated
from [Kabacoff, 2011])
Hadoop
NoSQL
Data
Stores
Loading data sets stored
within packages
See
previous presentation
Many packages include datasets, such as ggplot2; aftera package is
loaded, all of its datasets are available:
> library(ggplot2)
> str(diamonds)
'data.frame': 53940 obs. of 10 variables:
$ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut
: Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4
5 ...
$ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num 55 61 65 58 58 57 57 55 61 61 ...
$ price : int 326 326 327 334 335 336 336 337 337 338 ...
$ x
: num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y
: num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z
: num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Loading data objects saved in the
workspace of a previous session
Save
the workspace associated with the current
session (the workspace contains all the of the
existing data objects at a point in time ):
> ws.name <- paste("work", Sys.Date(),
+ ".RData", sep="")
> save.image(file=ws.name)
Restore
(load) a previously saved workspace
(and all the data object in the workspace):
> load("work2014-09-12.RData")
When
more worspaced have been saved, one
can choos wich one to load
> load(file.choose())
Data entered from the keyboard (1)
The
simplest method of data entry
function edit() launches a text editor that
allows entering your data manually
If
the data frame exists:
> student_gi <- edit(student_gi)
or
> fix(student_gi)
Data entered from the keyboard
(2)
If
the data frame does not exist, follow two
steps:
1 Create an empty data frame (or matrix) with the
variable names and types you want to have in the final
dataset.
> mydata <- data.frame(age=numeric(0),
+ gender=character(0),
weight=numeric(0))
2 Invoke the text editor on this data object, enter data,
and save the results back to the data object.
> mydata <- edit(mydata)
Or
> fix(mydata)
Data entered from clipboard
One
can copy into clipboard small sections of data in a table
(e.g. a spreadsheet, a Web HTML table) using control-C
(copy command)
On
Windows, command read.table handles clipboard data
with a header row that is separated by tabs, and stores the
data in a data frame (x):
> x <- read.table(file = "clipboard", sep="\t",
+ header=TRUE)
On
Mac, the pipe ("pbpaste") function the equivalent:
copy without header
> x <- read.table(pipe("pbpaste"), sep="\t",
+ header=FALSE)
copy with header
> y <- read.table(pipe("pbpaste"), sep="\t",
+ header=TRUE)
Import from local CSV/delimited
text files
read.table()
reads a file in table format
and saves it as a data frame
> mydataframe = read.table(file,
+ header=logical_value,
+
sep="delimiter", row.names="name")
file is a delimited ASCII file
header is a logical value indicating whether the first
row contains variable names (TRUE or FALSE )
sep specifies the separating data values
row.names is an optional parameter specifying
one or more variables to represent row identifiers.
Data input from local delimited text
files
Data
frame births2006 is located in directory
DataSets/births2006
Delimitator in the source file is Tab (\t)
The name of the file to be imported is births2006.txt
Qualifying the subdirectory is a bit different from an operating
system to another
>
+
+
+
+
+
+
+
+
+
switch(Sys.info()[['sysname']],
Windows= {births2006 <- read.table(
"births2006\\births2006.txt",
fileEncoding = "UTF-8", header = TRUE, sep="\t")},
Linux = {births2006 <- read.table(
"births2006/births2006.txt",
fileEncoding = "UTF-8", header = TRUE, sep="\t")},
Darwin = {births2006 <- read.table(
"births2006/births2006.txt",
fileEncoding = "UTF-8", header = TRUE, sep="\t")} )
Data input from local delimited text
files (cont.)
When
we are not sure about the file name,
instead of the filename one can use the
function file.choose():
births2006.2 <- read.table(file.choose(),
+ fileEncoding = "UTF-8",
+ header = TRUE,
+ sep="\t")
Data input from local CSV files
Import
one dataset from Irina Dan's Ph.D. thesis concerning
a study of using e-documents in companies
Data source is a tab delimited file - companyinfo.csv
Current working directory is .../DataSets
File to be imported is located in directory
.../DataSets/IrinaDan
>
+
+
+
+
+
+
+
+
+
switch(Sys.info()[['sysname']],
Windows= {comp <- read.table(
"IrinaDan\\companyinfo.csv",
header=TRUE, sep=";", stringsAsFactors=FALSE)},
Linux = {comp <- read.table(
"IrinaDan/companyinfo.csv",
header=TRUE, sep=";", stringsAsFactors=FALSE)},
Darwin = {comp <- read.table(
"IrinaDan/companyinfo.csv",
header=TRUE, sep=";", + stringsAsFactors=FALSE)}
)
Data input from local CSV files
(cont.)
Import
a dataset from Dragos Cogean's Ph.D. thesis
which compares two cloud database services, Mongo
and MySQL
Data sources are tab delimited files
The .txt (tab-delimited) file resides in directory
\DataSets\DragosCogean
Notice the second version of function switch()
>
+
+
+
+
+
+
+
+
InsertMongoALL <- switch(Sys.info()[['sysname']],
Windows= { read.table(
"DragosCogean\\InsertMongo_ALL.txt",
fileEncoding = "UTF-8", header=TRUE,
sep="\t", stringsAsFactors=FALSE)},
Darwin = { read.table(
"DragosCogean/InsertMongo_ALL.txt",
fileEncoding = "UTF-8", header=TRUE,
sep="\t", stringsAsFactors=FALSE)} )
Data input from local CSV files
(cont.)
Import
Toyota Corolla second hand cars
data set (located in
...\DataSets\ToyotaCorolla directory);
Also notice the third version of function
switch()
> ToyotaCorolla <- read.table(
+
switch(Sys.info()[['sysname']],
+ Windows =
{"ToyotaCorolla\\ToyotaCorolla.csv"},
+ Darwin= {"ToyotaCorolla/ToyotaCorolla.csv"}),
+
fileEncoding = "UTF-8", header = TRUE,
+
sep=","
+ )
Data input from text file available on
web
Heart
attack data set
Description available at:
http://courses.statistics.com/software/R/tables4R.htm
The data set (as delimited text file) available at:
http://courses.statistics.com/Intro1/Lesson2/heartatk4R.txt
> heart.att = read.table(
http://courses.statistics.com/Intro1/Lesson2/heartatk
4R.txt
, header=TRUE)
> head(heart.att)
Patient DIAGNOSIS SEX DRG DIED CHARGES LOS AGE
1
1
41041
F 122
0 4752.00 10 79
2
2
41041
F 122
0 3941.00
6 34
3
3
41091
F 122
0 3657.00
5 76
4
4
41081
F 122
0 1481.00
2 80
5
5
41091
M 122
0 1681.00
1 55
Data input from CSV file available on
web
data set: 356 people polled on their
smoking status (Smoke) and their socioeconomic
status (SES).
The data file contains only two columns, and
when read R interprets them both as factors:
Smoking
> smoker <read.csv("http://www.cyclismo.org/tutorial/R/_static/
smoker.csv")
> head(smoker)
Smoke SES
1 former High
2 former High
3 former High
4 former High
5 former High
6 former High
Data input from CSV file available on
web
data set: 356 people polled on their
smoking status (Smoke) and their socioeconomic
status (SES).
The data file contains only two columns, and
when read R interprets them both as factors:
Smoking
> smoker <read.csv("http://www.cyclismo.org/tutorial/R/_static/
smoker.csv")
> head(smoker)
Smoke SES
1 former High
2 former High
3 former High
4 former High
5 former High
6 former High
Download and read
When
a data set is large, instead of the direct import...
> dat.csv <read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")
...
one can proceed in two steps:
1. download the file
>
download.file("http://archive.ics.uci.edu/ml/machinelearning-databases/arrhythmia//arrhythmia.data",
destfile="data.csv")
trying URL 'http://archive.ics.uci.edu/ml/machinelearning-databases/arrhythmia//arrhythmia.data'
Content type 'text/plain; charset=UTF-8' length 402355
bytes (392 Kb)
opened URL
==================================================
downloaded 392 Kb
2. import the downloaded file
> df.2 <- read.csv("data.csv")
Importing data from Excel files
The
"simplest" way to read an Excel
(.xls/.xlsx) file is to save it in Excel as a text
(tab delimited) or csv file and then to read
it as in previous slides
Loading directly into R .xls/.xlsx files is
possible through various packages:
RO D BC
gdata
xlsReadW rite
XLConnect
xlsx
Problems (on Windows systems)
when loading some packages
> install.packages("xlsx")
> library(xlsx)
Loading required package: rJava
Loading required package: xlsxjars
xlsx
requires package rJava; on Windows systems that
sometimes creates problems (e.g. R 64 bits on Windows
7)
On my computer (Windows 7 64bit), a downloaded java
runtime in directory C:\\Program Files\\Java\\jre7\\ so
before package rJava or any other package which uses
rJava, I need to run:
> options(java.home="C:\\Program
Files\\Java\\jre7\\")
Reading data from Excel files
We'll
prefer package xlsx:
> library(xlsx)
ADL
master students (2013-2014) file is located in
directory .../DataSets/ADL)
Path is specified differently for Windows and Mac OS
systems:
> workbook <-
switch(Sys.info()[['sysname']],
+
Windows=
{ "ADL\\ADL2013_Studenti.xlsx"},
+
Darwin = { "ADL/ADL2013_Studenti.xlsx"})
> adl2013_stud <- read.xlsx(workbook, 1)
> str(adl2013_stud)
Import data from local PostgreSQL
databases
PostgreSQL
database serves is installed locally, on my laptop
(on Windows systems check if the PostgreSQL service is
started)
Package needed: RPostgreSQL
> install.packages("RPostgreSQL")
> library(RPostgreSQL)
Load
the PostgreSQL driver
> drv <- dbDriver("PostgreSQL")
Open
a connection
On Windows systems:
> con <- dbConnect(drv, dbname="bd2014", user="bd2014",
+
password="bd2014")
On Mac OS
> con <- dbConnect(drv, port=5433, dbname="sales2014",
+
user="sales2014", password="sales2014")
Import data from local PostgreSQL
databases (cont.)
Launch
the PostgreSQL query; the result of the query will be saved into data
frame invoice_detailed:
> invoice_detailed <+
dbGetQuery(con,
"SELECT i.invoiceNo, invoiceDate, i.customerId,
customerName, place, countyName, region,
comments, invoiceRowNumber, i_d.productId,
productName, unitOfMeasurement, category,
quantity, unitPrice, quantity * unitPrice AS amountWithoutVAT,
quantity * unitPrice * (1 + VATPercent) AS amount
FROM invoices i
INNER JOIN invoice_details i_d ON i.invoiceNo = i_d.invoiceNo
INNER JOIN products p ON i_d.productId = p.productId
INNER JOIN customers c ON i.customerid = c.customerid
INNER JOIN postcodes pc ON c.postCode = pc.postCode
INNER JOIN counties ON pc.countyCode =counties.countyCode
ORDER BY i.invoiceNo, invoiceRowNumber")
Import data from local PostgreSQL
databases (cont.)
Launch
the PostgreSQL query; the result of the query will be saved
into data frame invoice_detailed:
> head(invoice_detailed,3)
invoiceno invoicedate customerid customername place
1
1111
2012-08-01
1001 Client 1 SRL
Iasi
1111
2012-08-01
1001 Client 1 SRL
Iasi
1111
2012-08-01
1001 Client 1 SRL
Iasi
countyname
region comments invoicerownumber productid
Iasi Moldova
<NA>
Iasi Moldova
<NA>
Iasi Moldova
<NA>
productname unitofmeasurement
category quantity
Product 1
b500ml Category A
50
Product 2
kg Category B
75
Product 5
unit Category A
50
unitprice amountwithoutvat amount
1
1000
50000
62000
1050
78750
88200
7060
353000 437720
Saving the data frame(s)
Data
frame(s) will be saved (for further use) in
directory .../DataSets/sales
Path
qualification is different between Windows
and Mac systems:
> file.name <-
switch(Sys.info()[['sysname']],
Windows=
{ "sales\\invoice_detailed.RData"},
Darwin = { "sales/invoice_detailed.RData"})
> save(invoice_detailed, file = file.name)
After
saving, whenever needed, the data frame can
be loaded into Rstudio session with load function
Close connections/drivers
After
Close
the import, the resources must be freed
all PostgreSQL connections
for (connection in dbListConnections(drv) )
{
dbDisconnect(connection)
}
Frees
all the resources on the driver
> dbUnloadDriver(drv)
Import data from a remote PostgreSQL
databases (!!!!!!!!!!!!!!!!!!!)
...for
the moment it is impossible to externally
(outside FEAA) access the database servers
Access Oracle databases through
JDBC
Package
ROracle was intended to provide access to Oracle databases
Unfortunately, now package ROracle is not available
Next example was inspired by
http://www.r-bloggers.com/connecting-r-to-an-oracle-database-with-rjdbc
/
As the name suggests, the solution needs dealing with some Java
"things"
Requirements: JDK/JRE previously installed
Download ojdbc jar from www.oracle.com (in my case, ojdbc6.jar)
Set JAVA_HOME, set max. memory, and load rJava library
Sys.setenv(JAVA_H O M E= '/path/to/java_hom e')
on my Mac OS:
>
Sys.setenv(JAVA_HOME='/Library/Java/JavaVirtualMachines/jdk1.7
.0_45.jdk/Contents/Home')
> options(java.parameters="-Xmx2g")
> install.packages("rJava")
Access Oracle databases through
JDBC (cont.)
Getting some information
Java version
> .jinit()
> print(.jcall("java/lang/System", "S", "getProperty",
"java.version"))
classPath (just for the record)
> .jclassPath()
Load RJDBC package
> Install.packages(RJDBC)
> library(RJDBC)
Create connection driver and open connection
> jdbcDriver <JDBC(driverClass="oracle.jdbc.OracleDriver",
classPath="/Users/admin/Downloads/ojdbc6.jar")
Access Oracle databases through
JDBC (cont.)
Open connection
> jdbcConnection <- dbConnect(jdbcDriver,
+
"jdbc:oracle:thin:@//10.10.0.7:1521/orcl",
"bd2",
"bd2")
Launch the Oracle query and store the result into the data
frame st
> st <- dbGetQuery(jdbcConnection,
+ "SELECT * FROM studenti")
Close connection
> dbDisconnect(jdbcConnection)
Import data from MongoDB
Import data from Cassandra
Import data from Hadoop
Read HTML tables from the web
Package needed: XML
> install.packages("XML")
> library(XML)
> myURL <"http://www.jaredlander.com/2012/02/another-kindof-super-bowl-pool/"
> dfHTML <- readHTMLTable(myURL, which=1,
header=FALSE,
+
stringsAsFactors = FALSE)
> head(dfHTML,3)
V1
V2
V3
1 Participant 1 Giant A Patriot Q
2 Participant 2 Giant B Patriot R
Import XML files
Package needed: XML
> library(XML)
Web address of the xml file
> url <"http://www.statistics.life.ku.dk/primer/mydata.xml
"
Import
> indata <- xmlToDataFrame(url)
> head(indata, 5)
Girth Height Volume
1
8.3
70
10.3
8.6
65
10.3
8.8
63
10.2
10.5
72
16.4
Reading HTML pages with multiple
tables
Package needed: XML
> library(XML)
The web page contains text and a number of tables
> url.1 <'http://en.wikipedia.org/wiki/World_population'
> tbls.1 <- readHTMLTable(url.1)
> class(tbls.1)
[1] "list"
Display how many tables are on the page
> length(tbls.1)
[1] 28
Read only the 1st table on this page (which=1)
> tbl.1.1 <- readHTMLTable(url.1, which=1,
header=F, stringsAsFactors = FALSE)
Importing data from other
statistical packages
Package
needed: foreign
> install.packages("foreign")
> library(foreign)
Read a Stata data file (.dta)
> states <- read.dta("states.dta")
Read a local SPSS file
> spss1 <- read.spss("p004.sav",
+ use.value.labels = TRUE, to.data.frame =
TRUE)
Import the SPSS file directly from web address
> spss2 <read.spss("http://www.ats.ucla.edu/stat/spss/ex
amples/chp/p004.sav",
+ use.value.labels = TRUE,
Save/export R data objects
Save
a data frame as a .csv file
> write.csv(spss2, file = "spss2.csv")
Save a data frame as a tab delimited text file
> write.table(spss2, file = "spss2.txt",
+ sep = "\t", fileEncoding = "UTF-8")
Save a data frame as an Excel (xlsx) file (requires
package xlsx)
> write.xlsx(spss2, file = "spss2.xlsx",
sheetName="spss2")
> write.xlsx(echipe.4,
+ file = "Centralizator BD2_2013_SIA1.xlsx",
+ sheetName="t4.echipe",
+ row.names=FALSE, append=TRUE, showNA=FALSE)
Save/export R data objects (cont.)
Save
a dataframe as a .dta file (requires package
foreign)
> write.dta(spss2, file = "spss2.dta")
Save
to binary R format (can save multiple
datasets and R objects)
> save(invoice.details.ro,
+ file = "invoice.details.ro.RData")
> save(states, spss2, dat.xls,
+ file = "temp.RData")