SIGMOD
SIGMOD
[Extended Abstract]
ABSTRACT Keywords
Without any doubt, spreadsheets are the most commonly spreadsheets, relational databases, relational algebra, SQL,
used applications for data management and analysis. Per- end-user computing
haps they are even among the most widely used computer
applications of all kinds. However, the spreadsheet paradigm 1. INTRODUCTION
of computation still lacks sufficient theoretical analysis.
In this paper we consider the relationship of spreadsheets Spreadsheets are the end-user computing counterpart of
to database systems. We demonstrate that a spreadsheet databases and OLAP in the enterprise-scale computing. They
can play the role of a relational database engine, without any serve basically the same purpose — data management and
use of macros or built-in programming languages, merely by analysis, but at the opposite extremes of the data quantity
using spreadsheet formulas. We achieve that by implement- scale.
ing all operators of relational algebra by means of spread- At the same time spreadsheets are extremely popular.
sheet functions. Given a definition of a database (say in Their users range from every(wo)men who manage their
SQL), it is possible to construct a spreadsheet workbook home budgets, to business professionals and researchers who
with empty worksheets for data tables and worksheets filled create and examine extremely sophisticated models and data.
with formulas for queries. Since then on, when the user For example, Science journal writes in its instructions for
enters, alters or deletes data in the data worksheets, the for- authors [4] as follows.
mulas in query worksheets automatically compute the actual
In general, Science will accept the following nine
results of the queries. Thus, the spreadsheet serves as data
categories of supporting online material:
storage and executes SQL queries, and therefore acts as a
relational database engine. [. . . ]
Syntactically and semantically, the paper is based on Mi- 8.Databases – In certain cases, Science will con-
crosoft Excel (TM) 2003 version, because so far there is sider linked database presentations more com-
no formal model of spreadsheets that might be used for plex than a flat text file or table; these can in-
that purpose. However, the presented constructions work clude, for example, tables hyperlinked to public
in other spreadsheet systems, too. sequence, array, or protein databases, or collec-
tions of hypertext tables or Excel files linked to
explanatory image files or tables. Such presen-
Categories and Subject Descriptors tations may require special treatment, and should
H.2.4 [Database Management]: Systems—relational data- be discussed in advance with the online editor.
bases; H.4.1 [Information Systems Applications]: Office Submission of databases such as those de-
Automation—spreadsheets; K.8.1 [Personal Computing]: scribed above will generally only be appro-
Application Packages—spreadsheets priate when the data in question can not
be accommodated by an established public
repository such as Genbank or PDB.
General Terms
spreadsheets, relational databases In practice, Excel files are quite common as a form of
supporting online material in Science. The same journal
provides an example of a scientific controversy [6, 3] which
finally turned out to be related to the design of a spreadsheet
Permission to make digital or hard copies of all or part of this work for used for data analysis.
personal or classroom use is granted without fee provided that copies are Despite that, and surprisingly enough, spreadsheets, and
not made or distributed for profit or commercial advantage and that copies the spreadsheet paradigm in general, lack sufficient theoret-
bear this notice and the full citation on the first page. To copy otherwise, to ical analysis. There is even no formal model of spreadsheets,
republish, to post on servers or to redistribute to lists, requires prior specific which might be the base of such analysis. There are only a
permission and/or a fee.
SIGMOD 2010 few papers considering spreadsheets from the point of view
Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$10.00. of functional programming paradigm [1, 2, 7, 9, 11], while
we think that spreadsheets constitute a paradigm by them- the following expressions are cell references in the R1C1 no-
selves. There is also vast literature devoted to the practice tation: RmCn, R[i]Cm, RmC[j], R[i]C[j], RCm, RC[i], RmC,
of using spreadsheets, and even the European Spreadsheet R[i]C.
Risks Interest Group EuSpRIG http://www.eusprig.org The number after ‘R’ refers to the row number and the
with its annual conference. number after ‘C’ to the column number. If the number is
It seems therefore surprising that the computer science missing, it means “same row (column)” as the cell in which
community did not study any further that extremely popu- this expression is used. If the number is written in square
lar, important and successful type of application. brackets, it is a relative reference and the cell to which this
In this paper we do not attempt to create a formal model expression points should be determined by adding the num-
of spreadsheets. Instead, we aim at providing a strong evi- ber in brackets to the row (column) number of the present
dence, that spreadsheets are a very interesting type of soft- cell. Numbers without brackets are absolute references and
ware systems and deserve more research. Specifically, we refer to a cell whose row (column) number is equal to that
consider the relation of spreadsheets to database systems. number. For example, R[−1]C7 denotes a cell which is in the
It is a natural comparison, because spreadsheets indeed of- row directly above the present one in column 7, while RC[3]
ten play the role of small databases at the end-user level of denotes a cell in the same row as the present one and 3
computing. columns to the right. If R or C is itself omitted, the resulting
We demonstrate that virtually any spreadsheet system is expression denotes the whole column or row (respectively),
a relational database engine. We do so by implementing all e.g., C7 is the (whole) column number 7. For the purpose of
operators of relational algebra using spreadsheet functions. data validation or for referencing cells in other worksheets,
For each query in SQL, we construct a spreadsheet work- RC may also be used, and references the cell whose row and
book with empty worksheets for data tables and worksheets column numbers are equal to the address of the cell in which
filled with formulas for queries. As the user enters, alters this expression is located. Ranges are composed generally
or deletes data tuples in the data worksheets, the formulas from two cell references separated by a colon, and mean a
in query worksheets automatically compute the actual re- rectangular area, spanned by the two cells.
sults of the queries. Thus, the spreadsheet serves as data max stands for the maximal number of rows permitted in a
storage, and executes SQL queries. It is therefore a rela- worksheet. This number may be imposed by the spreadsheet
tional database engine. Consequently, any specification of system that is used, or by the user who decides to limit the
a database, written in SQL in the form of table and view quantity of data that can be stored, in exchange for better
definitions, can be compiled into a spreadsheet workbook performance.
which has exactly the same functionality as if the database
was implemented in a classical RDBMS. Crucially, this is 2.2 IF function
achieved without any use of macros written in an external IF is a condtional function in spreadsheets. It syntax is
programming language, like Visual Basic or the like. One IF(condition,true_branch,false_branch). What makes
might consider our construction also as an implementation it unusual is that its evaluation is lazy, i.e., after the con-
of a relational database on a completely new type of (virtual) dition is evaluated and yields either TRUE or FALSE, only
hardware. one of the branches is evaluated. It makes IF very useful.
As a model of spreadsheet syntax and semantics we take It can be used to protect functions from being applied to
Microsoft Excel (TM) (the general reference is [8]), but our arguments of wrong types, trap errors, and, last but not
constructions work in other similar systems, like OpenOffice least, to speed up execution of queries by avoiding lengthy
Calc, gnumeric or Google docs, too. computations in certain cases.
Consequently, the final result is the number of rows, in 3. ARCHITECTURE OF A DATABASE IM-
which the columns C1 and C2 contain the same pair of num- PLEMENTED IN A SPREADSHEET
bers as in R1C3:R1C4.
Example 2. 3.1 Overview
=SUMPRODUCT((R1C1:R5C1=R1C3)*(R1C2:R5C1=R1C4)* In this paper, we disregard a number of minor issues aris-
R1C5:R5C5) is calculated as follows: ing in practical implementation of database operations in a
1. the first three steps of evaluation are the same as be- spreadsheet. First of all, here is the obvious limitation of
fore; size on number and sizes of relations, views and their inter-
mediate results, imposed by the maximal available number
2. the sequence of 0s and 1s from previous item and the of worksheets, columns and rows in the spreadsheet system
range R1C5:R5C5 are multiplied again coordinate-wise, at hand. Next, the size of the data values (integers, strings,
which results in a sequence of five numbers; etc.) is also limited. The variety of data types in spread-
sheets is also restricted when compared to database systems.
3. again the sum of the above five numbers is returned.
The overall architecture of a relational database imple-
Consequently, the final result is the sum of values in C3, mented in a spreadsheet is as follows.
calculated over those ones which are located in rows, in Given specification of the database, an implementation of
which the columns C1 and C2 contain the same pair of num- a database is created by an external program (which plays
bers as in R1C3:R1C4. the role of query compiler), in the form of an .xls, .xlsx,
.odc, etc., file.
These two examples generalize to sum-multiplication of The whole resulting database is a workbook, consisting of
more than two or three arrays. one worksheet per data table and one worksheet per view in
The behavior of SUMPRODUCT very much resembles the way the database.
array formulas are evaluated. In fact, the formulas The data table worksheets are where the data is entered,
{=SUM((R1C1:R5C1=R1C3)*(R1C2:R5C2=R1C4))} updated and deleted. In the case of the (more theoretical
in flavor) implementation of the relational algebra, the data
and table sheets do not contain any formulas and are simply
the place to enter tuples into relations. In the case of SQL
{=SUM((R1C1:R5C1=R1C3)*(R1C2:R5C1=R1C4)*R1C5:R5C5)} implementation, the cells are equipped with data validation
formulas, which perform data type verification, enforce PRI-
are exactly equivalent to our two examples. MARY KEY, FOREIGN KEY and other integrity constraints in-
cluded in the CREATE TABLE statements. pendent on the data they will work on. However, we would
The query (view) worksheets are not supposed to be edited like to stress that there is no reason to reject nonuniform
by the user. They contain columns filled with formulas, implementations, should they appear to be more effective or
which calculate the consecutive values of the result of the permit expressing queries inexpressible in uniform way.
query. Besides the result columns of the query, the view In the following we will consider both set and bag (mul-
worksheets can also contain a number of hidden columns, tiset) semantics of the relational algebra. In the first case,
which calculate and store intermediate results emerging dur- duplicate rows are not permitted in the relations and queries,
ing query evaluation. It is important that the formulas are in the latter they are permitted. However, even in the set se-
completely uniform in each column of the database work- mantics a spreadsheet representation of a relation may con-
book, and they do not depend on the data which will be tain many null rows.
stored in the application. Initially all formulas compute the Furthermore, the representation may be loose if null rows
empty string "" value, representing unused space. When the are interspersed with the tuples, or standard if all the tuples
user manually enters data into the tables, the automatic re- come first, followed by the null rows.
computation of the spreadsheet causes the results of queries Consequently, we have loose-set, loose-bag, standard-set
to be computed and appear in the view worksheets. and standard-bag semantics.
No matter which of the above semantics above we have in
mind, the result of the query appears exactly as if it were a
4. THEORETICAL LEVEL: RELATIONAL table, and can be used as such. Now the only thing neces-
ALGEBRA sary to compose queries is to locate their implementations
We assume the semantics over a fixed domain of (the side by side in a single worksheet and change input column
spreadsheet’s implementation of) integers, so that a rela- numbers in the formulas computing the outermost query, to
tion is a set or multiset of tuples over the integers that are agree with the column numbers of the outputs of the argu-
implemented in the spreadsheet software. ment queries (and then the output columns of the argument
queries become the intermediate results columns of the com-
4.1 Compositionality position).
Therefore, queries represented in this way are composi-
We assume the unnamed syntax for the relational algebra:
tional.
relations and queries have columns, which are numbered and
Now it suffices to demonstrate that the each of the follow-
do not have any names. Sometimes we consider the expres-
ing relational algebra operators from [5] can be implemented
sions C1, C2, etc., as the names of the worksheet columns,
in a spreadsheet:
as well as the names of the columns in relations.
The representation of a relation r of arity n is a group of n • Two operations peculiar to spreadsheets, absent in [5]:
consecutive columns in a worksheet, whose rows contain the
tuples in the relation. The rows in which there are no tuples – Error trapping.
of r are assumed to be filled with the empty string formula – Standardization.
="", evaluating to the empty string value "", which the user
can replace by the new tuples of the relation. The empty • Sorting.
string is never a component of a tuple in a relation or query.
Therefore either all cells in a row contain the empty string, • Duplicate removal δr.
or none does. The rows of tables and queries evaluating to • Selection σθ r.
empty strings are called null rows henceforth.
The assumption that ="" formulas fill the empty rows of • Projection πi,j,... r.
data tables is only for uniformity of presentation. A for-
mula in a cell can not evaluate to ”empty cell” (because the • Union r ∪ s.
formula occupies that cell anyway), only to empty string. • Difference r \ s.
Therefore, if blank cells were used in empty rows, formulas
expressing queries must have been adapted to accept unused • Cartesian product r × s.
space in two different forms: empty cells in data tables, and
empty strings in results of other queries. Moreover, blank • Grouping with aggregation γL r:
cells are interpreted as 0 by many Excel functions, which – Grouping with SUM.
makes formulas prepared for blank cells even more compli-
– Grouping with COUNT.
cated.
The representation of a relational algebra query Q of arity – Grouping with AVG.
m is a group of l + m consecutive columns in a worksheet. – Grouping with MAX and MIN.
All its rows from 1 to max are filled with formulas (identical
in all cells of each column), which calculate the tuples in Q. Note that in Google docs spreadsheet there are special
We assume that the formulas in the last m columns should built-in operators for sorting and duplicate removal. Sorting
return either (a component of) a tuple in the result of Q, or is of course present in Excel and other spreadsheet systems
the empty string value "". The additional l columns are also (duplicate removal is additionally present in Excel 2007), but
filled with identical formulas, which calculate intermediate can not be invoked by a formula, and requires a sequence of
results. A worksheet of this kind can be created by entering clicks by the user. This can not be accepted, as we want the
the formulas in the first row, and then filling them down- queries to compute automatically.
ward to fill the first max rows. This uniformity assumption Generally, the sets of functions present in spreadsheets are
means in particular, that the formulas are completely inde- highly redundant, so the same computation can be achieved
Figure 2: The query SELECT lastname, AVG(income) FROM incomes GROUP BY lastname HAVING COUNT(*)>3, comput-
ing average family income, implemented in a spreadsheet. (Errors appearing in the worksheet are intended.)
in many different ways. In this theoretical section we choose by the formula =IF(ISERROR(F),"",F), any error produced
solutions which are common to most of (or even all) spread- by =F is replaced by the empty string, and otherwise the
sheet systems. This way we believe to consider the spread- value is the same as the value of =F.
sheet paradigm, even if its definition is not yet formulated in
the literature. 4.3.2 Standardization
4.2 Notation
This operation converts a relation from loose to standard
We use the following convention for presenting queries im-
form, moving null rows to the bottom. The relative order of
plemented in a spreadsheet:
non-null rows is preserved. We assume that columns C1 and
COLUMNS < =FORMULA
C2 contain the source data.
means that the =FORMULA is entered into the COLUMNS, which C3 < =SUMPRODUCT((R1C1:RC1<>"")*1)
may be specified either to be a single column (e.g. C5 )
counts the non-null rows above the present row, including
or a range of a few columns (e.g. C5:C7), or a single cell
the present one. This number is the row number to which
(e.g. R1C5), and in each case belongs to the columns with
the present row should be relocated. Note that multiplica-
intermediate values. The formula
tion by 1 enforces boolean to integer conversion.
COLUMNS << =FORMULA
C4 < =MATCH(ROW(),R1C3:RmaxC3,0)
indicates that formulas located in COLUMNS calculate the out-
put of the query.
The function MATCH(ROW(),R1C3:RmaxC3,0) searches for
In all cases, we fill the first max rows of the indicated
the value of the number of the present row (computed by
columns.
ROW()) in C3 and returns the row number of the first exact
Sometimes the output columns are not specified, and then
match found. If no match is found (i.e., we are in a row
it is always indicated, that the output is computed by ap-
whose number is higher than the total number of non-null
plying another, already defined operation to some of the
rows), an error is returned.
columns with intermediate results. In any case, it is as-
C5:C6 << =IF(ISERROR(RC4),"",
sumed that the first max rows of the LOCATION are filled with
INDEX(R1C[-4]:RmaxC[-4],RC4))
formulas, except when it is a single cell. max stands in the
following always for a concrete integer, which is written di- Errors are trapped, and when there is no error, INDEX
rectly into the formulas. returns the data from the suitable row of C[-4]. Thus the
Generally, we assume the arguments of the algebra oper- values from C1:C2 get relocated to their positions calculated
ators to be two- or three-ary relations or queries, the gener- in C3.
alization to higher arities is straightforward.
Except of the standardization and sorting, in all other 4.4 Sorting
cases we assume the input to be in standard form, i.e., null Now we describe an implementation of sorting, which is a
rows at the bottom. generalization of standardization. We assume that columns
C1 and C2 contain the source data and we sort in ascending
4.3 Error trapping and standardization order by the values in C1.
In this section we describe two special purpose operators, C3 < =SUMPRODUCT((R1C1:RmaxC1<=RC1)*1)
which perform very common and useful tasks, specific to our
spreadsheet environment. This puts in RiC3 the number of entries in column C1
which are smaller than or equal to RiC1. "" compared by
4.3.1 Error trapping <= is larger than any number, so null rows do not give any
errors, and in the following are treated as the largest entries.
If we replace a formula =F, which may produce an error, C4 < =RC3-SUMPRODUCT((RC1:RmaxC1=RC1)*1)+1
ities does not exceed max.
Now in RiC4 is the number of entries in column C1 which Then use the following formulas to calculate their union
are either smaller than RiC1 or equal to it and located in the in standard bag form, which can be subsequently brought
same row or above it. This is the number of the row into to loose set form by duplicate removal and then to standard
which RiC1 should be relocated during sort. set form by standardization.
C5:C6 << =INDEX(R1C[-4]:RmaxC[-4], R1C5 < =COUNT(C1)
MATCH(ROW(),R1C4:RmaxC4,0))
This part is very similar to the standardization solution, This is the number of non-null rows of C1.
C6:C7 << =IF(ROW()<=R1C5,RC[-5],
except that there are no errors to be trapped and we combine INDEX(R1C[-3]:RmaxC[-3],ROW()-R1C5))
two formulas into one.
Sorting in descending order is done by reverting the signs If the present row number is less than R1C5 then we take
of numbers by the formula =IF(RC[-1]="";"";-RC[-1]) and the same row from C1:C2, otherwise we take rows from C3:C4
sorting into ascending order, and then reverting the signs whose numbers are suitably shifted. Note that this works
again. This leaves the null rows at the bottom. In partic- when the inputs are standard (set or bag). Therefore, if
ular, if sorting is necessary there is no need to standardize the input relations are loose, they should be brought to the
first. standard form, before taking union.
An important property of this operation is that rows with
empty string in the column on which the sort is performed,
4.9 Difference
are moved to the bottom. Consequently, sorting brings any Assume that we are given two relations located in C1:C2
query or relation to standard form. Moreover, this form and C3:C4, respectively. Then use the following formulas to
of sorting does not affect the relative order of tuples, which calculate their set difference.
C5 < =SUMPRODUCT((R1C3:RmaxC3=RC1)*
have identical values in the column on which they are sorted.
(R1C4:RmaxC4=RC2))
4.5 Duplicate removal This calculates in RiC5 the number of times a tuple equal
Next we describe the implementation of duplicate removal, to RiC1:RiC2 appears in C3:C4.
which, among other things, converts its input data from bag C6:C7 << =IF(RC5=0,RC[-3],"")
to set semantics. For the purpose of illustration, we assume
the table to contain two columns C1:C2. Now if RiC5 is 0, we copy the row RiC1:RiC2 to the output,
C3 < =SUMPRODUCT((R1C1:RC1=RC1)*(R1C2:RC2=RC2)) otherwise we replace it by a null row.
The set form of the result is inherited from the inputs, but
This causes RiC3 to contain the number of tuples from certainly may contain null rows and is therefore loose. How-
C1:C2 which are equal to RiC1:RiC2 and are located at the ever, this construction does not work for the bag format,
same level or above it. This number is 1 iff the row contains since in this case we should count the copies of identical rows
the first occurrence of this tuple. in both relations and put in the output a suitable number
C4:C5 << =IF(RC3=1,RC[-3],"") of such rows.
The more complicated construction which does work is as
Now the first occurrences of tuples are copied into C4:C5, follows:
the other are replaced by null rows. Standardization can be C5 < =SUMPRODUCT((R1C3:RmaxC3=RC1)*
used to bring the result to the standard form, if desired. (R1C4:RmaxRC2))
This, exactly as before, calculates in RiC5 the number of
4.6 Selection times a tuple equal to RiC1:RiC2 appears in C3:C4.
Assume that we are given a relation r located in C1:C2 C6 < =SUMPRODUCT((R1C1:RC1=RC1)*(R1C2:RC2=RC2))
and we want to compute σθ r, where θ is a boolean combi-
nation of equalities and inequalities concerning the values Now we calculate in RiC6 the number of times a tuple
of columns of r and constants. Then we use a spreadsheet equal to RiC1:RiC2 appears in C1:C2 in row i or above it.
formula expressing θ to substitute "" for the rows which do C7:C8 << =IF(RC5>=RC6;"";RC[-6])
not satisfy θ. This is best explained on an example: if θ
is (C1 ≤ 100 ∧ C2 > C1) ∨ C2 6= 175, then the selection is Now we replace by null rows the first RiC5 occurrences of
implemented by tuple RiC1:RiC2, and leave unaffected the remaining ones,
C3:C4 << =IF(OR(AND(RC1<=100,RC2>RC1), which gives the desired bag difference. The resulting relation
RC2<>175),RC[-2],"") is loose.
It leaves the result of the selection in a loose (set or bag, in- 4.10 Cartesian product
herited from the input) form, but, as always, can be brought Assume that we are given two relations located in C1:C2
to the standard form. and C3:C4, respectively, and that the product of their car-
dinalities does not exceed max.
4.7 Projection Then use the following formulas to calculate their Carte-
The case of projection is quite easy: it amounts to omit- sian product. The construction below works only for rela-
ting some columns from the input relation/query. tions in standard form, so if the inputs are loose, standard-
ization is necessary first.
4.8 Union R1C5 < =COUNT(R1C1:R1Cmax)
Assume that we are given two relations located in C1:C2
and C3:C4, respectively, and that the sum of their cardinal- R2C5 < =COUNT(R1C3:RmaxC3)
section, and is more dependent on the particular properties
We calculate the numbers of non-null rows in C1:C2 and of Excel.
C3:C4. Of the three parts of SQL: DDL, DML and DCL, that last
C6:C7 << =IF(ROW()<=R1C5*R2C5, one is irrelevant, since we construct a database for a single
INDEX(R1C[-5]:RmaxC[-5], user.
INT(ROW()-1,R2C5)+1),"")
This creates R1C5 blocks, the i-th block being R2C5 copies 5.1 NULL values
of RiC1:RiC2. NULLs can be represented simply by the string NULL and
C8:C9 << =IF(ROW()<=R1C5*R2C5,
handled as such. This is not difficult, rather tedious, since
INDEX(R1C[-5]:RmaxC[-5],
all the formulas, whether implementing DDL or DML state-
MOD(ROW()-1,R2C5)+1),"")
ments, must be adjusted to handle NULLs by introducing
This repeats in circular fashion the consecutive rows of conditional IFs which test if the argument is a NULL and
C3:C4 a total of R1C5 rounds. invoke either a special treatment of NULL or the standard
Note that in this case, the set or bag form of the initial formula for non-NULLs.
relations is inherited by their product.
Figure 4: Cost of an insertion, for table from Example 3 with max equal 2500, and for foreign key table with
500 (triangles) and 1000 values (squares), respectively
Figure 5: Cost of sorting, for tables with max equal 2000 (triangles) and 5000 (squares)
Figure 6: Cost of computing query from Figure 2, for tables with max equal 1000 (diamonds), 1500 (triangles)
and 2000 (squares)