Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views12 pages

SIGMOD

This paper explores the potential of spreadsheets as relational database engines, demonstrating that they can perform data management and analysis using spreadsheet formulas without the need for macros. By implementing relational algebra operators through spreadsheet functions, users can create workbooks that function similarly to traditional databases, allowing for automatic query results as data is modified. The authors argue that spreadsheets deserve more theoretical analysis and research due to their widespread use and capabilities in end-user computing.

Uploaded by

sital.kafle95
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views12 pages

SIGMOD

This paper explores the potential of spreadsheets as relational database engines, demonstrating that they can perform data management and analysis using spreadsheet formulas without the need for macros. By implementing relational algebra operators through spreadsheet functions, users can create workbooks that function similarly to traditional databases, allowing for automatic query results as data is modified. The authors argue that spreadsheets deserve more theoretical analysis and research due to their widespread use and capabilities in end-user computing.

Uploaded by

sital.kafle95
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Spreadsheet As a Relational Database Engine

[Extended Abstract]

ABSTRACT Keywords
Without any doubt, spreadsheets are the most commonly spreadsheets, relational databases, relational algebra, SQL,
used applications for data management and analysis. Per- end-user computing
haps they are even among the most widely used computer
applications of all kinds. However, the spreadsheet paradigm 1. INTRODUCTION
of computation still lacks sufficient theoretical analysis.
In this paper we consider the relationship of spreadsheets Spreadsheets are the end-user computing counterpart of
to database systems. We demonstrate that a spreadsheet databases and OLAP in the enterprise-scale computing. They
can play the role of a relational database engine, without any serve basically the same purpose — data management and
use of macros or built-in programming languages, merely by analysis, but at the opposite extremes of the data quantity
using spreadsheet formulas. We achieve that by implement- scale.
ing all operators of relational algebra by means of spread- At the same time spreadsheets are extremely popular.
sheet functions. Given a definition of a database (say in Their users range from every(wo)men who manage their
SQL), it is possible to construct a spreadsheet workbook home budgets, to business professionals and researchers who
with empty worksheets for data tables and worksheets filled create and examine extremely sophisticated models and data.
with formulas for queries. Since then on, when the user For example, Science journal writes in its instructions for
enters, alters or deletes data in the data worksheets, the for- authors [4] as follows.
mulas in query worksheets automatically compute the actual
In general, Science will accept the following nine
results of the queries. Thus, the spreadsheet serves as data
categories of supporting online material:
storage and executes SQL queries, and therefore acts as a
relational database engine. [. . . ]
Syntactically and semantically, the paper is based on Mi- 8.Databases – In certain cases, Science will con-
crosoft Excel (TM) 2003 version, because so far there is sider linked database presentations more com-
no formal model of spreadsheets that might be used for plex than a flat text file or table; these can in-
that purpose. However, the presented constructions work clude, for example, tables hyperlinked to public
in other spreadsheet systems, too. sequence, array, or protein databases, or collec-
tions of hypertext tables or Excel files linked to
explanatory image files or tables. Such presen-
Categories and Subject Descriptors tations may require special treatment, and should
H.2.4 [Database Management]: Systems—relational data- be discussed in advance with the online editor.
bases; H.4.1 [Information Systems Applications]: Office Submission of databases such as those de-
Automation—spreadsheets; K.8.1 [Personal Computing]: scribed above will generally only be appro-
Application Packages—spreadsheets priate when the data in question can not
be accommodated by an established public
repository such as Genbank or PDB.
General Terms
spreadsheets, relational databases In practice, Excel files are quite common as a form of
supporting online material in Science. The same journal
provides an example of a scientific controversy [6, 3] which
finally turned out to be related to the design of a spreadsheet
Permission to make digital or hard copies of all or part of this work for used for data analysis.
personal or classroom use is granted without fee provided that copies are Despite that, and surprisingly enough, spreadsheets, and
not made or distributed for profit or commercial advantage and that copies the spreadsheet paradigm in general, lack sufficient theoret-
bear this notice and the full citation on the first page. To copy otherwise, to ical analysis. There is even no formal model of spreadsheets,
republish, to post on servers or to redistribute to lists, requires prior specific which might be the base of such analysis. There are only a
permission and/or a fee.
SIGMOD 2010 few papers considering spreadsheets from the point of view
Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$10.00. of functional programming paradigm [1, 2, 7, 9, 11], while
we think that spreadsheets constitute a paradigm by them- the following expressions are cell references in the R1C1 no-
selves. There is also vast literature devoted to the practice tation: RmCn, R[i]Cm, RmC[j], R[i]C[j], RCm, RC[i], RmC,
of using spreadsheets, and even the European Spreadsheet R[i]C.
Risks Interest Group EuSpRIG http://www.eusprig.org The number after ‘R’ refers to the row number and the
with its annual conference. number after ‘C’ to the column number. If the number is
It seems therefore surprising that the computer science missing, it means “same row (column)” as the cell in which
community did not study any further that extremely popu- this expression is used. If the number is written in square
lar, important and successful type of application. brackets, it is a relative reference and the cell to which this
In this paper we do not attempt to create a formal model expression points should be determined by adding the num-
of spreadsheets. Instead, we aim at providing a strong evi- ber in brackets to the row (column) number of the present
dence, that spreadsheets are a very interesting type of soft- cell. Numbers without brackets are absolute references and
ware systems and deserve more research. Specifically, we refer to a cell whose row (column) number is equal to that
consider the relation of spreadsheets to database systems. number. For example, R[−1]C7 denotes a cell which is in the
It is a natural comparison, because spreadsheets indeed of- row directly above the present one in column 7, while RC[3]
ten play the role of small databases at the end-user level of denotes a cell in the same row as the present one and 3
computing. columns to the right. If R or C is itself omitted, the resulting
We demonstrate that virtually any spreadsheet system is expression denotes the whole column or row (respectively),
a relational database engine. We do so by implementing all e.g., C7 is the (whole) column number 7. For the purpose of
operators of relational algebra using spreadsheet functions. data validation or for referencing cells in other worksheets,
For each query in SQL, we construct a spreadsheet work- RC may also be used, and references the cell whose row and
book with empty worksheets for data tables and worksheets column numbers are equal to the address of the cell in which
filled with formulas for queries. As the user enters, alters this expression is located. Ranges are composed generally
or deletes data tuples in the data worksheets, the formulas from two cell references separated by a colon, and mean a
in query worksheets automatically compute the actual re- rectangular area, spanned by the two cells.
sults of the queries. Thus, the spreadsheet serves as data max stands for the maximal number of rows permitted in a
storage, and executes SQL queries. It is therefore a rela- worksheet. This number may be imposed by the spreadsheet
tional database engine. Consequently, any specification of system that is used, or by the user who decides to limit the
a database, written in SQL in the form of table and view quantity of data that can be stored, in exchange for better
definitions, can be compiled into a spreadsheet workbook performance.
which has exactly the same functionality as if the database
was implemented in a classical RDBMS. Crucially, this is 2.2 IF function
achieved without any use of macros written in an external IF is a condtional function in spreadsheets. It syntax is
programming language, like Visual Basic or the like. One IF(condition,true_branch,false_branch). What makes
might consider our construction also as an implementation it unusual is that its evaluation is lazy, i.e., after the con-
of a relational database on a completely new type of (virtual) dition is evaluated and yields either TRUE or FALSE, only
hardware. one of the branches is evaluated. It makes IF very useful.
As a model of spreadsheet syntax and semantics we take It can be used to protect functions from being applied to
Microsoft Excel (TM) (the general reference is [8]), but our arguments of wrong types, trap errors, and, last but not
constructions work in other similar systems, like OpenOffice least, to speed up execution of queries by avoiding lengthy
Calc, gnumeric or Google docs, too. computations in certain cases.

2. TECHNICALITIES 2.3 SUMPRODUCT function


We will often use a special function called SUMPRODUCT.
The paper is written assuming Microsoft Excel (TM) 2003
It is one of the few formulas which can operate on lists of
as the target system. The newest (at the time of this writing)
data elements rather than on single ones. Its uses will be
Excel 2007 provides a couple of new functions, which simplify
generally modifications of the following two examples.
some of the tasks, but are not present in other spreadsheet
systems. Therefore we chose the older version. Example 1.
2.1 R1C1 notation =SUMPRODUCT((R1C1:R5C1=R1C3)*(R1C2:R5C2=R1C4)) is cal-
culated as follows:
We assume the reader to be familiar with spreadsheets.
The choice of Excel is due to itspopularity and the fact that 1. each cell in the range R1C1:R5C1 is compared with
it accepts the row-column R1C1-style addressing of cells and R1C3, and this yields a sequence of five booleans;
ranges, as opposed to, e.g., OpenOffice Calc, Google docs and
similar tools. This notation is easier to handle in a formal 2. each cell in the range R1C2:R5C2 is compared with
description, although in everyday practice the equivalent A1 R1C4, and this yields another sequence of five booleans;
notation is dominating. The key advantage of the R1C1 no- 3. the two sequences from previous items are multiplied
tation is that the meaning of the formula is independent of coordinate-wise, which results in automatic data type
the cell in which it is located. conversion from booleans to integers (with 1 corre-
In the R1C1 notation, both rows and columns of work- sponding to TRUE and 0 to FALSE), and then normal
sheets are numbered by integers from 1 onward (so that an multiplication;
Excel spreadsheet set to R1C1 notation can be easily distin-
guished from one in the classical A1 notation). For arbitrary 4. SUMPRODUCT then adds the five numbers up and pro-
nonzero integers i and j and nonzero natural numbers m, n duces a single number as a result.
Figure 1: The idea of a database implementation in a spreadsheet

Consequently, the final result is the number of rows, in 3. ARCHITECTURE OF A DATABASE IM-
which the columns C1 and C2 contain the same pair of num- PLEMENTED IN A SPREADSHEET
bers as in R1C3:R1C4.
Example 2. 3.1 Overview
=SUMPRODUCT((R1C1:R5C1=R1C3)*(R1C2:R5C1=R1C4)* In this paper, we disregard a number of minor issues aris-
R1C5:R5C5) is calculated as follows: ing in practical implementation of database operations in a
1. the first three steps of evaluation are the same as be- spreadsheet. First of all, here is the obvious limitation of
fore; size on number and sizes of relations, views and their inter-
mediate results, imposed by the maximal available number
2. the sequence of 0s and 1s from previous item and the of worksheets, columns and rows in the spreadsheet system
range R1C5:R5C5 are multiplied again coordinate-wise, at hand. Next, the size of the data values (integers, strings,
which results in a sequence of five numbers; etc.) is also limited. The variety of data types in spread-
sheets is also restricted when compared to database systems.
3. again the sum of the above five numbers is returned.
The overall architecture of a relational database imple-
Consequently, the final result is the sum of values in C3, mented in a spreadsheet is as follows.
calculated over those ones which are located in rows, in Given specification of the database, an implementation of
which the columns C1 and C2 contain the same pair of num- a database is created by an external program (which plays
bers as in R1C3:R1C4. the role of query compiler), in the form of an .xls, .xlsx,
.odc, etc., file.
These two examples generalize to sum-multiplication of The whole resulting database is a workbook, consisting of
more than two or three arrays. one worksheet per data table and one worksheet per view in
The behavior of SUMPRODUCT very much resembles the way the database.
array formulas are evaluated. In fact, the formulas The data table worksheets are where the data is entered,
{=SUM((R1C1:R5C1=R1C3)*(R1C2:R5C2=R1C4))} updated and deleted. In the case of the (more theoretical
in flavor) implementation of the relational algebra, the data
and table sheets do not contain any formulas and are simply
the place to enter tuples into relations. In the case of SQL
{=SUM((R1C1:R5C1=R1C3)*(R1C2:R5C1=R1C4)*R1C5:R5C5)} implementation, the cells are equipped with data validation
formulas, which perform data type verification, enforce PRI-
are exactly equivalent to our two examples. MARY KEY, FOREIGN KEY and other integrity constraints in-
cluded in the CREATE TABLE statements. pendent on the data they will work on. However, we would
The query (view) worksheets are not supposed to be edited like to stress that there is no reason to reject nonuniform
by the user. They contain columns filled with formulas, implementations, should they appear to be more effective or
which calculate the consecutive values of the result of the permit expressing queries inexpressible in uniform way.
query. Besides the result columns of the query, the view In the following we will consider both set and bag (mul-
worksheets can also contain a number of hidden columns, tiset) semantics of the relational algebra. In the first case,
which calculate and store intermediate results emerging dur- duplicate rows are not permitted in the relations and queries,
ing query evaluation. It is important that the formulas are in the latter they are permitted. However, even in the set se-
completely uniform in each column of the database work- mantics a spreadsheet representation of a relation may con-
book, and they do not depend on the data which will be tain many null rows.
stored in the application. Initially all formulas compute the Furthermore, the representation may be loose if null rows
empty string "" value, representing unused space. When the are interspersed with the tuples, or standard if all the tuples
user manually enters data into the tables, the automatic re- come first, followed by the null rows.
computation of the spreadsheet causes the results of queries Consequently, we have loose-set, loose-bag, standard-set
to be computed and appear in the view worksheets. and standard-bag semantics.
No matter which of the above semantics above we have in
mind, the result of the query appears exactly as if it were a
4. THEORETICAL LEVEL: RELATIONAL table, and can be used as such. Now the only thing neces-
ALGEBRA sary to compose queries is to locate their implementations
We assume the semantics over a fixed domain of (the side by side in a single worksheet and change input column
spreadsheet’s implementation of) integers, so that a rela- numbers in the formulas computing the outermost query, to
tion is a set or multiset of tuples over the integers that are agree with the column numbers of the outputs of the argu-
implemented in the spreadsheet software. ment queries (and then the output columns of the argument
queries become the intermediate results columns of the com-
4.1 Compositionality position).
Therefore, queries represented in this way are composi-
We assume the unnamed syntax for the relational algebra:
tional.
relations and queries have columns, which are numbered and
Now it suffices to demonstrate that the each of the follow-
do not have any names. Sometimes we consider the expres-
ing relational algebra operators from [5] can be implemented
sions C1, C2, etc., as the names of the worksheet columns,
in a spreadsheet:
as well as the names of the columns in relations.
The representation of a relation r of arity n is a group of n • Two operations peculiar to spreadsheets, absent in [5]:
consecutive columns in a worksheet, whose rows contain the
tuples in the relation. The rows in which there are no tuples – Error trapping.
of r are assumed to be filled with the empty string formula – Standardization.
="", evaluating to the empty string value "", which the user
can replace by the new tuples of the relation. The empty • Sorting.
string is never a component of a tuple in a relation or query.
Therefore either all cells in a row contain the empty string, • Duplicate removal δr.
or none does. The rows of tables and queries evaluating to • Selection σθ r.
empty strings are called null rows henceforth.
The assumption that ="" formulas fill the empty rows of • Projection πi,j,... r.
data tables is only for uniformity of presentation. A for-
mula in a cell can not evaluate to ”empty cell” (because the • Union r ∪ s.
formula occupies that cell anyway), only to empty string. • Difference r \ s.
Therefore, if blank cells were used in empty rows, formulas
expressing queries must have been adapted to accept unused • Cartesian product r × s.
space in two different forms: empty cells in data tables, and
empty strings in results of other queries. Moreover, blank • Grouping with aggregation γL r:
cells are interpreted as 0 by many Excel functions, which – Grouping with SUM.
makes formulas prepared for blank cells even more compli-
– Grouping with COUNT.
cated.
The representation of a relational algebra query Q of arity – Grouping with AVG.
m is a group of l + m consecutive columns in a worksheet. – Grouping with MAX and MIN.
All its rows from 1 to max are filled with formulas (identical
in all cells of each column), which calculate the tuples in Q. Note that in Google docs spreadsheet there are special
We assume that the formulas in the last m columns should built-in operators for sorting and duplicate removal. Sorting
return either (a component of) a tuple in the result of Q, or is of course present in Excel and other spreadsheet systems
the empty string value "". The additional l columns are also (duplicate removal is additionally present in Excel 2007), but
filled with identical formulas, which calculate intermediate can not be invoked by a formula, and requires a sequence of
results. A worksheet of this kind can be created by entering clicks by the user. This can not be accepted, as we want the
the formulas in the first row, and then filling them down- queries to compute automatically.
ward to fill the first max rows. This uniformity assumption Generally, the sets of functions present in spreadsheets are
means in particular, that the formulas are completely inde- highly redundant, so the same computation can be achieved
Figure 2: The query SELECT lastname, AVG(income) FROM incomes GROUP BY lastname HAVING COUNT(*)>3, comput-
ing average family income, implemented in a spreadsheet. (Errors appearing in the worksheet are intended.)

in many different ways. In this theoretical section we choose by the formula =IF(ISERROR(F),"",F), any error produced
solutions which are common to most of (or even all) spread- by =F is replaced by the empty string, and otherwise the
sheet systems. This way we believe to consider the spread- value is the same as the value of =F.
sheet paradigm, even if its definition is not yet formulated in
the literature. 4.3.2 Standardization
4.2 Notation
This operation converts a relation from loose to standard
We use the following convention for presenting queries im-
form, moving null rows to the bottom. The relative order of
plemented in a spreadsheet:
non-null rows is preserved. We assume that columns C1 and
COLUMNS < =FORMULA
C2 contain the source data.
means that the =FORMULA is entered into the COLUMNS, which C3 < =SUMPRODUCT((R1C1:RC1<>"")*1)
may be specified either to be a single column (e.g. C5 )
counts the non-null rows above the present row, including
or a range of a few columns (e.g. C5:C7), or a single cell
the present one. This number is the row number to which
(e.g. R1C5), and in each case belongs to the columns with
the present row should be relocated. Note that multiplica-
intermediate values. The formula
tion by 1 enforces boolean to integer conversion.
COLUMNS << =FORMULA
C4 < =MATCH(ROW(),R1C3:RmaxC3,0)
indicates that formulas located in COLUMNS calculate the out-
put of the query.
The function MATCH(ROW(),R1C3:RmaxC3,0) searches for
In all cases, we fill the first max rows of the indicated
the value of the number of the present row (computed by
columns.
ROW()) in C3 and returns the row number of the first exact
Sometimes the output columns are not specified, and then
match found. If no match is found (i.e., we are in a row
it is always indicated, that the output is computed by ap-
whose number is higher than the total number of non-null
plying another, already defined operation to some of the
rows), an error is returned.
columns with intermediate results. In any case, it is as-
C5:C6 << =IF(ISERROR(RC4),"",
sumed that the first max rows of the LOCATION are filled with
INDEX(R1C[-4]:RmaxC[-4],RC4))
formulas, except when it is a single cell. max stands in the
following always for a concrete integer, which is written di- Errors are trapped, and when there is no error, INDEX
rectly into the formulas. returns the data from the suitable row of C[-4]. Thus the
Generally, we assume the arguments of the algebra oper- values from C1:C2 get relocated to their positions calculated
ators to be two- or three-ary relations or queries, the gener- in C3.
alization to higher arities is straightforward.
Except of the standardization and sorting, in all other 4.4 Sorting
cases we assume the input to be in standard form, i.e., null Now we describe an implementation of sorting, which is a
rows at the bottom. generalization of standardization. We assume that columns
C1 and C2 contain the source data and we sort in ascending
4.3 Error trapping and standardization order by the values in C1.
In this section we describe two special purpose operators, C3 < =SUMPRODUCT((R1C1:RmaxC1<=RC1)*1)
which perform very common and useful tasks, specific to our
spreadsheet environment. This puts in RiC3 the number of entries in column C1
which are smaller than or equal to RiC1. "" compared by
4.3.1 Error trapping <= is larger than any number, so null rows do not give any
errors, and in the following are treated as the largest entries.
If we replace a formula =F, which may produce an error, C4 < =RC3-SUMPRODUCT((RC1:RmaxC1=RC1)*1)+1
ities does not exceed max.
Now in RiC4 is the number of entries in column C1 which Then use the following formulas to calculate their union
are either smaller than RiC1 or equal to it and located in the in standard bag form, which can be subsequently brought
same row or above it. This is the number of the row into to loose set form by duplicate removal and then to standard
which RiC1 should be relocated during sort. set form by standardization.
C5:C6 << =INDEX(R1C[-4]:RmaxC[-4], R1C5 < =COUNT(C1)
MATCH(ROW(),R1C4:RmaxC4,0))
This part is very similar to the standardization solution, This is the number of non-null rows of C1.
C6:C7 << =IF(ROW()<=R1C5,RC[-5],
except that there are no errors to be trapped and we combine INDEX(R1C[-3]:RmaxC[-3],ROW()-R1C5))
two formulas into one.
Sorting in descending order is done by reverting the signs If the present row number is less than R1C5 then we take
of numbers by the formula =IF(RC[-1]="";"";-RC[-1]) and the same row from C1:C2, otherwise we take rows from C3:C4
sorting into ascending order, and then reverting the signs whose numbers are suitably shifted. Note that this works
again. This leaves the null rows at the bottom. In partic- when the inputs are standard (set or bag). Therefore, if
ular, if sorting is necessary there is no need to standardize the input relations are loose, they should be brought to the
first. standard form, before taking union.
An important property of this operation is that rows with
empty string in the column on which the sort is performed,
4.9 Difference
are moved to the bottom. Consequently, sorting brings any Assume that we are given two relations located in C1:C2
query or relation to standard form. Moreover, this form and C3:C4, respectively. Then use the following formulas to
of sorting does not affect the relative order of tuples, which calculate their set difference.
C5 < =SUMPRODUCT((R1C3:RmaxC3=RC1)*
have identical values in the column on which they are sorted.
(R1C4:RmaxC4=RC2))
4.5 Duplicate removal This calculates in RiC5 the number of times a tuple equal
Next we describe the implementation of duplicate removal, to RiC1:RiC2 appears in C3:C4.
which, among other things, converts its input data from bag C6:C7 << =IF(RC5=0,RC[-3],"")
to set semantics. For the purpose of illustration, we assume
the table to contain two columns C1:C2. Now if RiC5 is 0, we copy the row RiC1:RiC2 to the output,
C3 < =SUMPRODUCT((R1C1:RC1=RC1)*(R1C2:RC2=RC2)) otherwise we replace it by a null row.
The set form of the result is inherited from the inputs, but
This causes RiC3 to contain the number of tuples from certainly may contain null rows and is therefore loose. How-
C1:C2 which are equal to RiC1:RiC2 and are located at the ever, this construction does not work for the bag format,
same level or above it. This number is 1 iff the row contains since in this case we should count the copies of identical rows
the first occurrence of this tuple. in both relations and put in the output a suitable number
C4:C5 << =IF(RC3=1,RC[-3],"") of such rows.
The more complicated construction which does work is as
Now the first occurrences of tuples are copied into C4:C5, follows:
the other are replaced by null rows. Standardization can be C5 < =SUMPRODUCT((R1C3:RmaxC3=RC1)*
used to bring the result to the standard form, if desired. (R1C4:RmaxRC2))
This, exactly as before, calculates in RiC5 the number of
4.6 Selection times a tuple equal to RiC1:RiC2 appears in C3:C4.
Assume that we are given a relation r located in C1:C2 C6 < =SUMPRODUCT((R1C1:RC1=RC1)*(R1C2:RC2=RC2))
and we want to compute σθ r, where θ is a boolean combi-
nation of equalities and inequalities concerning the values Now we calculate in RiC6 the number of times a tuple
of columns of r and constants. Then we use a spreadsheet equal to RiC1:RiC2 appears in C1:C2 in row i or above it.
formula expressing θ to substitute "" for the rows which do C7:C8 << =IF(RC5>=RC6;"";RC[-6])
not satisfy θ. This is best explained on an example: if θ
is (C1 ≤ 100 ∧ C2 > C1) ∨ C2 6= 175, then the selection is Now we replace by null rows the first RiC5 occurrences of
implemented by tuple RiC1:RiC2, and leave unaffected the remaining ones,
C3:C4 << =IF(OR(AND(RC1<=100,RC2>RC1), which gives the desired bag difference. The resulting relation
RC2<>175),RC[-2],"") is loose.
It leaves the result of the selection in a loose (set or bag, in- 4.10 Cartesian product
herited from the input) form, but, as always, can be brought Assume that we are given two relations located in C1:C2
to the standard form. and C3:C4, respectively, and that the product of their car-
dinalities does not exceed max.
4.7 Projection Then use the following formulas to calculate their Carte-
The case of projection is quite easy: it amounts to omit- sian product. The construction below works only for rela-
ting some columns from the input relation/query. tions in standard form, so if the inputs are loose, standard-
ization is necessary first.
4.8 Union R1C5 < =COUNT(R1C1:R1Cmax)
Assume that we are given two relations located in C1:C2
and C3:C4, respectively, and that the sum of their cardinal- R2C5 < =COUNT(R1C3:RmaxC3)
section, and is more dependent on the particular properties
We calculate the numbers of non-null rows in C1:C2 and of Excel.
C3:C4. Of the three parts of SQL: DDL, DML and DCL, that last
C6:C7 << =IF(ROW()<=R1C5*R2C5, one is irrelevant, since we construct a database for a single
INDEX(R1C[-5]:RmaxC[-5], user.
INT(ROW()-1,R2C5)+1),"")
This creates R1C5 blocks, the i-th block being R2C5 copies 5.1 NULL values
of RiC1:RiC2. NULLs can be represented simply by the string NULL and
C8:C9 << =IF(ROW()<=R1C5*R2C5,
handled as such. This is not difficult, rather tedious, since
INDEX(R1C[-5]:RmaxC[-5],
all the formulas, whether implementing DDL or DML state-
MOD(ROW()-1,R2C5)+1),"")
ments, must be adjusted to handle NULLs by introducing
This repeats in circular fashion the consecutive rows of conditional IFs which test if the argument is a NULL and
C3:C4 a total of R1C5 rounds. invoke either a special treatment of NULL or the standard
Note that in this case, the set or bag form of the initial formula for non-NULLs.
relations is inherited by their product.

4.11 Grouping with aggregation 5.2 DDL


Let’s discuss DDL, i.e., mainly CREATE TABLE state-
In the following, we assume always the relation to be lo-
ments. We adopt the option to distinguish the data table
cated in C1:C3, grouping done over C1:C2 and aggregation
from its input area. So for each CREATE TABLE statement
over C3.
we create a separate data table and a separate input table.
4.11.1 GROUP BY with SUM The latter is indeed a query table (see below for details),
which filters tuples which do not satisfy integrity constraints
included in the DDL statement and displays a warning mes-
C4 < =SUMPRODUCT((R1C1:RmaxC1=RC1)*
sage for the user. The former then fetches the rows which
(R1C2:RmaxC2=RC2)*R1C3:RmaxC3)
satisfy the integrity constraints (by merely looking if there
This array formula computes in RiC4 the sum of all RjC3 is a warning message or not), and does standardization.
over all j such that RjC1:RjC2 is equal to RiC1:RiC2. We assume the user to enter data elements adding them
Now we do duplicate elimination over C1:C2 and C4 and at the bottom. If elements are removed (simply using the
that is the desired result. DEL key), no new elements are added at their positions.
Updates are performed by removing the old version of the
4.11.2 GROUP BY with COUNT tuple and immediately adding the new one at the bottom.
Function TYPE allows one to distinguish text, booleans
C4 < =SUMPRODUCT((R1C1:RmaxC1=RC1)* and numbers, which, together with length function LEN for
(R1C2:RmaxC2=RC2)) strings and inequalities for numbers allow one to enforce
This is quite similar to the previous case, except that C4 data type declarations. There are a few limitations to this
computes counts of rows rather than sum. rule, e.g., the empty string "" plays a special role in our im-
plementation of relational algebra operators, and so does
4.11.3 GROUP BY with AVG the string "NULL", which imposes a (mild) restriction on
what kind of strings can be used. For the DATE statement,
One has to compute GROUP BY with SUM and GROUP BY however, one has to use formatting, instead, which enforces
with COUNT side-by-side and return the copy of C1:C2 plus numbers to be interpreted and displayed as dates.
the sum column divided by the count column. UNIQUE and PRIMARY KEY are enforced by the duplicate
elimination operator described above, which rejects tuples
4.11.4 GROUP BY with MAX and MIN which have already appeared before.
FOREIGN KEY statements are enforced by a semijoin query,
which can be constructed using already described algebra
Let us consider MAX, the other being handled symmetri-
operators. It does not seem that there is an easy method
cally. First, the whole relation is sorted into descending
to implement policies concerning behavior of the database
order by C3. On the result, elimination of duplicates is per-
when one deletes a foreign key for a tuple, except the CAS-
formed, which however considers two rows identical already
CADE option. This one is completely automatic: when the
when they agree on C1:C2. Our implementation of this op-
foreign key disappears, the tuples which reference it become
eration eliminates all occurrences of a tuple except the very
illegal and disappear from the data table (because the for-
first one. In this case, the one left is accompanied by the
mulas which transfer them to the data table return "" in the
maximal value of C3, as desired.
absence of the foreign key), even though they remain in the
4.12 Summary input table (where a warning appears).
Concerning INDEX, indexes can not be created in the usual
At this point we have already achieved the main goal of
sense, but there is a simple method which helps in some situ-
this paper. We have demonstrated that spreadsheets can
ations when index does. It amounts to creating a copy of the
implement and execute all relational algebra queries.
relation sorted by the column with the INDEX. Experiments
show, that searching with MATCH function is faster on sorted
5. PRACTICAL LEVEL: SQL columns, which already speeds up queries. Furthermore, one
This part is devoted to the discussion of the implementa- can create a separate table with the unique values from this
tion issues of SQL-92. It is less detailed than the previous column, along with the numbers of their occurrences in the
original sorted table. This can considerably speed up, e.g., (9-11) If RC3 is a number then it must be an integer whose
the computation of equijoins. absolute value does not exceed N , and if not number,
then it is the string "NULL".
Example 3. Let us consider the following DDL statement:
(12) RC4 is a text, whose length is in the specified range.
CREATE TABLE Orders( Function LEN accepts numbers as inputs, so there is no
Id INT UNSIGNED NOT NULL PRIMARY KEY, need to protect its uses by IF.
ModelID INT NOT NULL REFERENCES Models (ModelID),
Version SMALLINT, Then the OrdersData worksheet contains formulas
ModelDescrip VARCHAR(40));
=IF(OrdersInput!RC5="Invalid data","",OrdersInput!RC)
We assume the following:
in all its four columns.
• the input worksheet for the above table is OrdersInput,
and the worksheet of the data table is OrdersData; As it can be seen from the example, the actual translation
of the CREATE TABLE statements can be quite complicated.
• worksheet ModelsData keeps in column C1 the primary The main reason for that now we have many data types and
key referenced to above; some of the functions must be prevented from being applied
• size limits for INT and SMALLINT are M and N , respec- to arguments of wrong type, NULLs may show up, etc.
tively; 5.3 DML
• the limit for the number of rows in data tables is max. As we have already demonstrated, spreadsheets have the
full power of executing relational algebra queries, i.e., all
The following formula is placed in column C5: SELECT queries of SQL-92 can be evaluated. Note that Ex-
ample 3 gives really an example of a (rather simple) query,
=IF(AND(RC1="",RC2="",RC3="",RC4=""),"",
too. Except data type verification, that query does a semi-
IF(OR(RC1="",RC2="",RC3="",RC4=""),"Invalid data",
join to check the FOREIGN KEY statement and duplicate elim-
IF(
ination to satisfy the PRIMARY KEY declaration.
AND(
As the user communicates with the database directly by
IF(TYPE(RC1)=1,INT(RC1)=RC1,FALSE),
its input tables, there is no need to implement INSERT, DELETE
RC1>=0,RC1<=M,
and UPDATE statements, although individual inserts can eas-
COUNTIF(R1C1:RC1,RC1)=1,
ily be combined with the DDL declarations and executed by
COUNTIF(ModelsData!R1C1:RmaxC1,RC2)=1,
the compiler when creating the spreadsheet implementation
IF(TYPE(RC3)=1,
of the database.
AND(INT(RC3)=RC3,ABS(RC3)<=N),
RC3="NULL"),
TYPE(RC4)=2,LEN(RC4)<=40), 6. PERFORMANCE
"","Invalid data"))) Unfortunately, Excel and other spreadsheets have not been
designed to serve as database engines, so we can not ex-
The explanation of this formula is as follows: the formula pect very good performance of our implementations. Array
is a big IF, which behaves as follows: it returns an empty and aggregation formulas generally always do linear scans
string (numbers in parentheses refer to the lines in the for- of their arguments, and they are used in a linear number
mula above): when the row is a null row (1), and otherwise of cells. Recomputation of cells, whether invoked automat-
an error message if at least one (but not all) of its fields is "" ically of manually, always applies to all of them, so they
(2), and otherwise again "" if all of the following conditions produce quadratic algorithms. Of course, there are still pos-
hold: sibilities to get some improvement (at least of the constants),
by using dynamic algorithms, which compute the values in
(5-6) The first column contains a number whose integer cells accessing only a few neighboring cells, or exploit the
part is equal to itself, is nonnegative and does not ex- lazy evaluation of IF statements to prune the computation
ceed M (note that we used IF – this formula has lazy trees significantly. This area is largely unexplored, as the
evaluation in Excel, hence the function INT is never whole problem of optimization queries to be executed in a
applied to non-numbers and does not give any error spreadsheet.
message). Apart from reducing the cost of operations, the other
(7) RC1 appears for the first time in its column (COUNTIF is important possibility is to reduce recomputation. Namely,
a single-column equivalent of SUMPRODUCT formulas we some of the systems (including Excel again) permit refer-
have used elsewhere). ences not only to other worksheets, but to other workbooks
(i.e., files), too. This gives the possibility to locate each
(8) RC2 appears exactly once in the table ModelsData in query in a different file and open it only when it is neces-
the first column (assumed to contain the primary key sary. It is then recomputed, but other queries are not. In
of that relation) and in this branch of the initial IFs particular, when working with data tables no queries need
it is not "" (in an extremely rare case the foreign key to be open. This kind of architecture is shown in Figure
column might contain exactly one "" value). Note that 1 at the beginning of the paper. A similar solution is to
it is not necessary to verify that RC2 is a nonnegative work with tables and queries located in worksheets of one
integer in the specified range, because it is enforced in workbook, but with automatic recomputation turned off.
the foreign key table, so the count takes care of it. Instead, as an act of computing a query, the user manually
orders recomputation only of the currently active worksheet are null to avoid high computation costs on them.
(Shift-F9 in Excel). The first test was conducted on a table from Example 3.
Below we present a few performance tests. All of them As we can see in Figure 4, a user of an average computer
were conducted using Excel 2003 running on one core of who is willing to wait 1 second for a result of his actions,
Intel(R) Core(TM)2 Duo CPU at 2.40GHz, in a laptop with can store more than 2500 tuples in a table with integrity
2 GB RAM and Windows XP Professional SP3. We do not constraints of medium complexity. The cost depends on the
claim the the findings of this section carry over to other size of the table with the foreign keys.
spreadsheets systems. All charts appear at the end of the
paper. 6.2.2 Sorting
Sorting is more time-consuming, and the cost depends on
6.1 Impact of optimization how large the table is and how many tuples are already
We discuss only two simple optimization techniques, which stored in it (we assume no integrity constraints on the table),
do not go beyond improving particular operations, and do as illustrated in Figure 5. However, the value of max does not
not consider at all the choice of a better logical query plans. influence the performance of the operation very significantly.
The first of them is avoiding computations on null rows by Remember that a table with max means a table filled with
utilizing the lazy evaluation of IF. It is assumed to be used max rows of formulas designed for sorting up to max rows,
in all experiments below. Other experiments, whose results which are recalculated no matter how many tuples are in
are not shown here, indicate that it reduces time cost sig- the relation at the moment. Again, the 1 second limit is
nificantly when there are many null rows in the tables, and located at about 2500 tuples.
does not create any significant overhead when there are only
few of them.
6.2.3 Average family income query
The other possible optimization is to choose between ar- The average family income query from Figure 2 is the
ray formulas, built-in aggregating functions and dynamic last example. The implementation uses, besides the IF op-
algorithms. We illustrate this on the example of the stan- timization, also the use of the fastest standardization from
dardization operator. paragraph 6.1, based on dynamic programming. Still, the
The implementation described in Section 4 is as follows: query proves to be more time-consuming than the previous
C3 < =SUMPRODUCT((R1C1:RC1<>"")*1) operations. Again, the value of max does not change the cost
very much.
C4 < =MATCH(ROW(),R1C3:RmaxC3,0)
6.3 Summary
C5:C6 << =IF(ISERROR(RC4),"", The general observation is that all the costs are indeed
INDEX(R1C[-4]:RmaxC[-4],RC4)) O(n2 ), but the good news is that the constants are rather
small and the times remain still reasonable for a few thou-
However, we have at least three other options to count the
sand tuples. Moreover, it seems that the main factor is the
number of non-null rows above or at the level of the present
quantity of data, rather than the size of the initial table.
row, used in C3.
The array-formula solution is to use
C3 < {=SUM((R1C1:RC1<>"")*1)} 7. CONCLUSIONS, FURTHER RESEARCH
We have demonstrated that relational algebra can be nat-
The dynamic programming solution is urally expressed in a spreadsheet, thus showing the power of
R1C3 < =IF(R1C1="",0,1) the spreadsheet paradigm, which subsumes on the theoreti-
cal level the paradigm of relational databases. This can be
R2C3:RmaxC3 < =IF(RC1="",R[-1]C,R[-1]C+1) understood as an implementation of a relational database
on a completely new type of (virtual) hardware. Of course,
The aggregate solution is in practice the effectiveness of this database is low.
C3 < =ROW()-COUNTIF(R1C1:RC1;"")+1 This immediately raises a number of new questions and
problems.
It is very instructive to compare their performance in
• Can a small database be practically implemented in a
query tables with max ranging between 50 and 5000, full
spreadsheet, yielding a really useful application? Our
of data in each case (other experiments indicate that the
performance tests suggest that it might be possible for
cost of this query does not depend significantly on the num-
storing a few thousand tuples of data.
ber of null rows in the table) in Figure 3. Remember that
the complete implementation we test contains three other • Can a database project written in SQL and compiled
aggregating functions in each row, which remain there in all to a spreadsheet serve (and be useful) as a rapidly cre-
cases. ated prototype of that database? The advantage of
The results suggest that SUMPRODUCT is just an alias for an this solution is that the spreadsheet would then pro-
array formula, since their performances are precisely identi- vide an instant, friendly user interface for experiments
cal, at least in this context. and demonstrations.

6.2 Performance tests • Develop a methodology to optimize SQL queries exe-


cuted in a spreadsheet.
6.2.1 Insertions • Can spreadsheets execute queries not expressible in
All tests assume relation in standard form, and the imple- SQL-92? In particular, can spreadsheets execute re-
mentations were optimized by using IF testing if the rows cursive queries, like those WITH ...SELECT in SQL-99,
or those in Datalog? It seems that the answer is neg- [3] E. J. Chesler, S. L. Rodriguez-Zas, J. S. Mogil,
ative, but a proof of impossibility requires a formal A. Darvasi, J. Usuka, A. Grupe, S. Germer, D. Aud,
model of spreadsheets, which does not exist so far. J. K. Belknap, R. F. Klein, M. K. Ahluwalia,
R. Higuchi, and G. Peltz. In silico mapping of mouse
• Our implementations of SQL-92 queries use uniform quantitative trait loci. Science, 294(5551):2423, 2001.
spreadsheets, in which all rows of a query table are In Technical Comments.
identical. Could nonuniformity help expressing more [4] Science. Preparing Your Supporting Online Material.
queries? http://www.sciencemag.org/about/
• Can spreadsheets naturally implement other models authors/prep/prep_online.dtl, accessed
of databases, like semi-structural or object-relational 20/10/2009.
ones? [5] H. Garcia-Molina, J. D. Ullman, and J. Widom.
Database System Implementation. Prentice-Hall, 2000.
• What is the ultimate limit of the spreadsheet paradigm [6] A. Grupe, S. Germer, J. Usuka, D. Aud, J. K.
of computation? Belknap, R. F. Klein, M. K. Ahluwalia, R. Higuchi,
and G. Peltz. In silico mapping of complex
8. ACKNOWLEDGMENTS disease-related traits in mice. Science,
I would like to thank several people, however in this blind 292(5523):1915–1918, 2001.
version all I can say is ”Wovon man nicht sprechen darf, [7] S. P. Jones, A. Blackwell, and M. Burnett. A
darüber muß man schweigen” (compare [10, Proposition 7]). user-centred approach to functions in Excel. In ICFP
’03: Proceedings of the Eighth ACM SIGPLAN
International Conference on Functional Programming,
9. REFERENCES pages 165–176, New York, NY, USA, 2003. ACM.
[1] R. Abraham and M. Erwig. Type inference for [8] Microsoft Corporation. Excel Home Page - Microsoft
spreadsheets. In PPDP ’06: Proceedings of the 8th Office Online. http://office.microsoft.com/en-us/
ACM SIGPLAN Symposium on Principles and excel/default.aspx, accessed 20/10/2009.
Practice of Declarative Programming, pages 73–84,
[9] D. Wakeling. Spreadsheet functional programming. J.
New York, NY, USA, 2006. ACM.
Funct. Program., 17(1):131–143, 2007.
[2] M. M. Burnett, J. W. Atwood, R. W. Djang,
[10] L. Wittgenstein. Logisch-philosophische Abhandlung,
J. Reichwein, H. J. Gottfried, and S. Yang. Forms/3:
Tractatus logico-philosophicus. Suhrkamp, Frankfurt
A first-order visual language to explore the boundaries
am Main, 1998. Kritische Edition.
of the spreadsheet paradigm. J. Funct. Program.,
[11] A. G. Yoder and D. L. Cohn. Real spreadsheets for
11(2):155–206, 2001.
real programmers. In H. E. Bal, editor, Proceedings of
the IEEE Computer Society 1994 International
Conference on Computer Languages, May 16-19,
1994, Toulouse, France, pages 20–30, 1994.
Figure 3: Costs of four solutions of standardization, for table with max between 50 and 5000

Figure 4: Cost of an insertion, for table from Example 3 with max equal 2500, and for foreign key table with
500 (triangles) and 1000 values (squares), respectively
Figure 5: Cost of sorting, for tables with max equal 2000 (triangles) and 5000 (squares)

Figure 6: Cost of computing query from Figure 2, for tables with max equal 1000 (diamonds), 1500 (triangles)
and 2000 (squares)

You might also like