RELATIONAL DATABASES
PRESIDENCY UNIVERSITY
Introduction
What is a database? A database consists of one or more tables; each table is
like a data-frame in R, where each record is like a data-frame’s row, and each field
is like a column. Related tables are linked by the use of unique identifiers or keys
(.e.g., patient IDs).
• A field is a variable/quantity of interest
• A record is a collection of fields
• A table is a collection of records which all have the same fields (with different
values)
• A database is a collection of tables
Databases versus data frames.
R jargon Database jargon
column field
row record
data frame table
types of the columns table schema
collection of data frames database
Why do we need database software? A natural question arises that why should
we work with databases when we already have the knowledge of dataframes in R.
In fact we need to understand what additional advantages we gain over dataframes
when we work with data bases.
• Size – R keeps its data frames in memory. Industrial databases can be much
bigger which can cause memory issues. Moreover the entire data may not
be useful and we may choose to work with selected subsets
• Speed – One of the issues when we work with large datasets in R is the
speed of computation. Even very simple tasks of data management can
painstaking if we work with big datasets in R. Databases are designed to
get the desired output very fast.
• Concurrency – Many users accessing the same database simultaneously –
Lots of potential for trouble (two users want to change the same record at
once).
The client-server model and SQL:. Databases live on a server, which manages
them. A database server hosts one or more databases. The server controls access
to the databases, and handles the actual business of reading out information from
the tables efficiently (or writing information to them). Users interact with the
server through a client program. Database clients are programs which interact
with the server. (Think of the difference between the Web browser program on
1
RELATIONAL DATABASES 2
your computer or phone [a client], and the Web server program which actually
hosts the website.) . Lets multiple users access the same database simultaneously.
SQL (structured query language) is the standard for database software. Most
basic actions are queries, like row/column selections on an R data frame. SQLite
is a simpler, file-based system that we will be working with.
Connecting R to SQL :. SQL is its own language, independent of R. But we’re
going to learn how to run SQL queries through R. Database management systems
(DMBS) actually implement SQL language and do the work – MySQL, SQLite, etc..
Each database management system (DBMS) has a somewhat different protocol for
actually dealing with servers. Rather than trying to master all of them, the sensible
approach in R is to use the DBI package, which provides a single uniform interface,
no matter what the DBMS. The sub-programs which handle different DBMS’s are
called drivers. Need a separate “driver” for each DBMS.
First, we need to install the packages ‘DBI‘, ‘RSQLite‘, then we load them into
our R session. Here we’ve installed the driver for one particular DBMS called
SQLite.
install.packages("DBI", dependencies = TRUE) # Install DBI
install.packages("RSQLite", dependencies = TRUE) # Install driver for SQLite
R uses persistent objects called “connections” to keep track of which database
server is being used for what task.
So the procedure goes like so:
(1) Load the driver for the DBMS you’ll be dealing with.
(2) Establish a connection to the database server.
(3) Interact with the DBMS.
(4) Close the connection.
(5) Unload the driver.
library(RSQLite)
## Warning: package ’RSQLite’ was built under R version 4.3.3
drv <- dbDriver('SQLite')
con <- dbConnect(drv, dbname="lahman2016.sqlite")
Load Driver, Establish Connection: con now is the name for the connection to the
SQLite database lahman2016.sqlite. We could also give dbConnect arguments host
(an Internet address), use (a user name) and password.
Interacting with a Database through DBI
Listing what’s available: The basic commands:
dbListTables(con) # Get tables in the database (returns vector)
[1] "AllstarFull" "Appearances" "AwardsManagers"
[4] "AwardsPlayers" "AwardsShareManagers" "AwardsSharePlayers"
[7] "Batting" "BattingPost" "CollegePlaying"
[10] "Fielding" "FieldingOF" "FieldingOFsplit"
[13] "FieldingPost" "HallOfFame" "HomeGames"
RELATIONAL DATABASES 3
[16] "Managers" "ManagersHalf" "Master"
[19] "Parks" "Pitching" "PitchingPost"
[22] "Salaries" "Schools" "SeriesPost"
[25] "Teams" "TeamsFranchises" "TeamsHalf"
dbListFields(con, "Batting") # List fields in Batting table
[1] "playerID" "yearID" "stint" "teamID" "lgID" "G"
[7] "G_batting" "AB" "R" "H" "2B" "3B"
[13] "HR" "RBI" "SB" "CS" "BB" "SO"
[19] "IBB" "HBP" "SH" "SF" "GIDP" "G_old"
dbListFields(con, "Pitching") # List fields in Pitching table
[1] "playerID" "yearID" "stint" "teamID" "lgID" "W"
[7] "L" "G" "GS" "CG" "SHO" "SV"
[13] "IPouts" "H" "ER" "HR" "BB" "SO"
[19] "BAOpp" "ERA" "IBB" "WP" "HBP" "BK"
[25] "BFP" "GF" "R" "SH" "SF" "GIDP"
batting = dbReadTable(con, "Batting")
class(batting)
[1] "data.frame"
dim(batting)
[1] 102816 24
Importing a table as a data frame: Now we could go on and perform R oper-
ations on batting, since it’s a data frame. We can use this route primarily to check
our work in SQL. But, in general, we should try to do as much in SQL as possible,
since it’s more efficient and can be simpler.
General form of SQL query: The really useful one:
dbGetQuery(con, statement)
The second argument is an SQL query, which gets issued to the database; it
returns a data frame
SELECT:. The main tool in the SQL language is SELECT, which allows you to
perform queries on a particular table in a database. It has the form:
SELECT columns or computations
FROM table
WHERE condition
GROUP BY columns
HAVING condition
ORDER BY column [ASC | DESC]
LIMIT offset,count;
WHERE, GROUP BY, HAVING, ORDER BY, LIMIT are all optional
Example. Suppose we want to pick out five columns from the table “Batting”, and
only look at the first 15 rows:
RELATIONAL DATABASES 4
dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
"FROM Batting",
"LIMIT 15"))
playerID yearID AB H HR
1 aardsda01 2004 0 0 0
2 aardsda01 2006 2 0 0
3 aardsda01 2007 0 0 0
4 aardsda01 2008 1 0 0
5 aardsda01 2009 0 0 0
6 aardsda01 2010 0 0 0
7 aardsda01 2012 0 0 0
8 aardsda01 2013 0 0 0
9 aardsda01 2015 1 0 0
10 aaronha01 1954 468 131 13
11 aaronha01 1955 602 189 27
12 aaronha01 1956 609 200 26
13 aaronha01 1957 615 198 44
14 aaronha01 1958 601 196 30
15 aaronha01 1959 629 223 39
To replicate this simple command on the imported data frame, we need to write
batting[1:15, c("playerID", "yearID", "AB", "H", "HR")]
playerID yearID AB H HR
1 aardsda01 2004 0 0 0
2 aardsda01 2006 2 0 0
3 aardsda01 2007 0 0 0
4 aardsda01 2008 1 0 0
5 aardsda01 2009 0 0 0
6 aardsda01 2010 0 0 0
7 aardsda01 2012 0 0 0
8 aardsda01 2013 0 0 0
9 aardsda01 2015 1 0 0
10 aaronha01 1954 468 131 13
11 aaronha01 1955 602 189 27
12 aaronha01 1956 609 200 26
13 aaronha01 1957 615 198 44
14 aaronha01 1958 601 196 30
15 aaronha01 1959 629 223 39
ORDER BY :. We can use the ORDER BY option in SELECT to specify an
ordering for the rows. Default is ascending order; add DESC for descending.
dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
"FROM Batting",
"ORDER BY HR DESC",
"LIMIT 10"))
playerID yearID AB H HR
1 bondsba01 2001 476 156 73
RELATIONAL DATABASES 5
2 mcgwima01 1998 509 152 70
3 sosasa01 1998 643 198 66
4 mcgwima01 1999 521 145 65
5 sosasa01 2001 577 189 64
6 sosasa01 1999 625 180 63
7 marisro01 1961 590 159 61
8 ruthba01 1927 540 192 60
9 ruthba01 1921 540 204 59
10 foxxji01 1932 585 213 58
SQL computations
SELECT, expanded: In the first line of SELECT, we can directly specify com-
putations that we want performed.
SELECT columns or computations
FROM table
WHERE condition
GROUP BY columns
HAVING condition
ORDER BY column [ASC | DESC]
LIMIT offset, count;
Main tools for computations: MIN, MAX, COUNT, SUM, AVG
Example. Suppose we want to calculate the average number of homeruns, and
average number of hits:
dbGetQuery(con, paste("SELECT AVG(HR), AVG(H)",
"FROM Batting"))
AVG(HR) AVG(H)
1 2.813599 37.13993
In order to replicate this simple command on an imported data frame, we need
to write:
mean(batting$HR, na.rm=TRUE)
[1] 2.813599
mean(batting$H, na.rm=TRUE)
[1] 37.13993
GROUP BY:. We can use the GROUP BY option in SELECT to define aggre-
gation groups.
dbGetQuery(con, paste("SELECT playerID, AVG(HR)",
"FROM Batting",
"GROUP BY playerID",
"ORDER BY AVG(HR) DESC",
"LIMIT 10"))
playerID AVG(HR)
1 pujolal01 36.93750
2 bondsba01 34.63636
RELATIONAL DATABASES 6
3 mcgwima01 34.29412
4 kinerra01 33.54545
5 aaronha01 32.82609
6 bryankr01 32.50000
7 ruthba01 32.45455
8 sosasa01 32.05263
9 cabremi01 31.85714
10 belleal01 31.75000
(Note: the order of commands here matters; if we try switching the order of
GROUP BY and ORDER BY, we’ll get an error)
AS:. We can use AS in the first line of SELECT to rename computed columns.
dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
"FROM Batting",
"GROUP BY yearID",
"ORDER BY avgHR DESC",
"LIMIT 10"))
yearID avgHR
1 1999 4.255581
2 1987 4.253817
3 2000 4.113439
4 2001 4.076176
5 2004 4.049777
6 1996 3.960096
7 1962 3.948684
8 2006 3.911402
9 1961 3.911175
10 2003 3.865627
WHERE:. We can use the WHERE option in SELECT to specify a subset of the
rows to use (pre-aggregation/pre-calculation).
dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
"FROM Batting",
"WHERE yearID >= 1990",
"GROUP BY yearID",
"ORDER BY avgHR DESC",
"LIMIT 10"))
yearID avgHR
1 1999 4.255581
2 2000 4.113439
3 2001 4.076176
4 2004 4.049777
5 1996 3.960096
6 2006 3.911402
7 2003 3.865627
8 2002 3.835481
9 1998 3.830560
RELATIONAL DATABASES 7
10 2016 3.782873
HAVING:. We can use the HAVING option in SELECT to specify a subset of
the rows to display (post-aggregation/post-calculation).
dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
"FROM Batting",
"WHERE yearID >= 1990",
"GROUP BY yearID",
"HAVING avgHR >= 4",
"ORDER BY avgHR DESC"))
yearID avgHR
1 1999 4.255581
2 2000 4.113439
3 2001 4.076176
4 2004 4.049777
SQL joins
SELECT, expanded: In the second line of SELECT, we can specify more than
one data table using JOIN
SELECT columns or computations
FROM tabA JOIN tabB USING(key)
WHERE condition
GROUP BY columns
HAVING condition
ORDER BY column [ASC | DESC]
LIMIT offset, count;
JOIN:.
• There are 4 types of joins:
– Inner join (or just join): retain just the rows each table that match
the condition
– Left outer join (or just left join): retain all rows in the first table, and
just the rows in the second table that match the condition
– Right outer join (or just right join): retain just the rows in the first
table that match the condition, and all rows in the second table
– Full outer join (or just full join): retain all rows in both tables
• Column values that cannot be filled in are assigned NA values
RELATIONAL DATABASES 8
Example. Suppose we want to figure out the average salaries of the players with
the top 10 highest homerun averages. Then we’d have to combine the two tables
below
dbGetQuery(con, paste("SELECT yearID, teamID, lgID, playerID, HR",
"FROM Batting",
"ORDER BY playerID",
"LIMIT 15"))
yearID teamID lgID playerID HR
1 2004 SFN NL aardsda01 0
2 2006 CHN NL aardsda01 0
3 2007 CHA AL aardsda01 0
4 2008 BOS AL aardsda01 0
5 2009 SEA AL aardsda01 0
6 2010 SEA AL aardsda01 0
7 2012 NYA AL aardsda01 0
8 2013 NYN NL aardsda01 0
9 2015 ATL NL aardsda01 0
10 1954 ML1 NL aaronha01 13
11 1955 ML1 NL aaronha01 27
12 1956 ML1 NL aaronha01 26
13 1957 ML1 NL aaronha01 44
14 1958 ML1 NL aaronha01 30
15 1959 ML1 NL aaronha01 39
dbGetQuery(con, paste("SELECT *",
"FROM Salaries",
"ORDER BY playerID",
"LIMIT 15"))
yearID teamID lgID playerID salary
1 2004 SFN NL aardsda01 300000
2 2007 CHA AL aardsda01 387500
3 2008 BOS AL aardsda01 403250
4 2009 SEA AL aardsda01 419000
RELATIONAL DATABASES 9
5 2010 SEA AL aardsda01 2750000
6 2011 SEA AL aardsda01 4500000
7 2012 NYA AL aardsda01 500000
8 1986 BAL AL aasedo01 600000
9 1987 BAL AL aasedo01 625000
10 1988 BAL AL aasedo01 675000
11 1989 NYN NL aasedo01 400000
12 2006 CIN NL abadan01 327000
13 2011 HOU NL abadfe01 418000
14 2012 HOU NL abadfe01 485000
15 2014 OAK AL abadfe01 525900
We can use a JOIN on the pair: yearID, playerID.
dbGetQuery(con, paste("SELECT yearID, playerID, salary, HR",
"FROM Batting JOIN Salaries USING(yearID, playerID)",
"ORDER BY playerID",
"LIMIT 15"))
yearID playerID salary HR
1 2004 aardsda01 300000 0
2 2007 aardsda01 387500 0
3 2008 aardsda01 403250 0
4 2009 aardsda01 419000 0
5 2010 aardsda01 2750000 0
6 2012 aardsda01 500000 0
7 1986 aasedo01 600000 0
8 1987 aasedo01 625000 0
9 1988 aasedo01 675000 0
10 1989 aasedo01 400000 0
11 2006 abadan01 327000 0
12 2011 abadfe01 418000 0
13 2012 abadfe01 485000 0
14 2014 abadfe01 525900 0
15 2015 abadfe01 1087500 0
Note that here we’re missing 3 David Aardsma’s records (i.e., the JOIN discarded
3 records). We can replicate this using merge() on imported data frames:
batting = dbReadTable(con, "Batting")
salaries = dbReadTable(con, "Salaries")
merged = merge(x=batting, y=salaries, by.x=c("yearID","playerID"),
by.y=c("yearID","playerID"))
merged[order(merged$playerID)[1:15],
c("yearID", "playerID", "salary", "HR")]
yearID playerID salary HR
16701 2004 aardsda01 300000 0
19371 2007 aardsda01 387500 0
20270 2008 aardsda01 403250 0
21157 2009 aardsda01 419000 0
22037 2010 aardsda01 2750000 0
RELATIONAL DATABASES 10
23795 2012 aardsda01 500000 0
578 1986 aasedo01 600000 0
1353 1987 aasedo01 625000 0
2026 1988 aasedo01 675000 0
2733 1989 aasedo01 400000 0
18458 2006 abadan01 327000 0
22914 2011 abadfe01 418000 0
23796 2012 abadfe01 485000 0
25594 2014 abadfe01 525900 0
26476 2015 abadfe01 1087500 0
For demonstration purposes, we can use a LEFT JOIN on the pair: yearID,
playerID
dbGetQuery(con, paste("SELECT yearID, playerID, salary, HR",
"FROM Batting LEFT JOIN Salaries USING(yearID, playerID)",
"ORDER BY playerID",
"LIMIT 15"))
yearID playerID salary HR
1 2004 aardsda01 300000 0
2 2006 aardsda01 NA 0
3 2007 aardsda01 387500 0
4 2008 aardsda01 403250 0
5 2009 aardsda01 419000 0
6 2010 aardsda01 2750000 0
7 2012 aardsda01 500000 0
8 2013 aardsda01 NA 0
9 2015 aardsda01 NA 0
10 1954 aaronha01 NA 13
11 1955 aaronha01 NA 27
12 1956 aaronha01 NA 26
13 1957 aaronha01 NA 44
14 1958 aaronha01 NA 30
15 1959 aaronha01 NA 39
Now we can see that we have all 9 of David Aardsma’s original records from the
Batting table (i.e., the LEFT JOIN kept them all, and just filled in an NA value
when it was missing his salary). Currently, RIGHT JOIN and FULL JOIN are not
implemented in the RSQLite package. Now, as to our original question (average
salaries of the players with the top 10 highest homerun averages):
dbGetQuery(con, paste("SELECT playerID, AVG(HR), AVG(salary)",
"FROM Batting JOIN Salaries USING(yearID, playerID)",
"GROUP BY playerID",
"ORDER BY Avg(HR) DESC",
"LIMIT 15"))
playerID AVG(HR) AVG(salary)
1 bryankr01 39.00000 652000
2 pujolal01 36.93750 12752527
3 bondsba01 34.63636 8556606
RELATIONAL DATABASES 11
4 mcgwima01 34.29412 4814021
5 arenano01 33.66667 2004167
6 howarry01 33.30000 15525500
7 troutmi01 33.25000 5919083
8 duvalad01 33.00000 510000
9 cartech02 32.75000 1919750
10 kingmda01 32.50000 908750
11 sosasa01 32.05263 6533474
12 cabremi01 31.85714 13457902
13 belleal01 31.75000 4996539
14 rodrial01 31.63636 18109830
15 delgaca01 31.33333 9753267