Spreadsheet Visualization Simplified
Spreadsheet Visualization Simplified
net/publication/234808991
CITATIONS READS
3 3,807
2 authors:
All content following this page was uploaded by Yirsaw Ayalew on 26 May 2014.
Faculty of Science
Department of Computer Science
By
Supervised By
June 2008
Dedication
I would like to dedicate this work to my parents: my late father, Mr. Freinderson
Ishmael Kankuzi and my mother, Mrs Ireen NyaMayuni Kankuzi.
ii
Approval
This dissertation has been examined as meeting the requirements for the partial
fulfillment of Master of Science Degree in Computer Science.
—————————– ———————–
Supervisor Date
—————————– ———————–
Internal Examiner Date
—————————– ———————–
External Examiner Date
—————————– ———————–
Head of Department Date
—————————– ———————–
Dean, School of Graduate Studies Date
iii
Acknowledgements
Firstly, I would like to thank God for giving me strength and courage in the course
of carrying out this work! A big thank you also goes to my supervisor, Dr Yirsaw
Ayalew, who tirelessly guided me in the course of this work. I also thank Dr Ayalew
for introducing me to academic research as well as an exciting world of spreadsheet
research. My other vote of thanks go to Dr Stephen Kobourov of the University of
Arizona who was also co-supervising me in the initial stages of this work and he also
provided me with the open-source code of the Graphael graph drawing software.
My heartfelt thanks also go to Mr. Y. Alide and Dr. P.C. Chamdimba, both from
the University of Malawi, for all the encouragement and support. May God richly
bless you. I would also like to thank all friends and relatives who gave me support
in the course of the work.
Finally, I also thank God for the ‘insights’ in the course of this work such that
a number of research papers have been published out of this research work.
This document has been produced with TeXnicCenter, a free and open-source soft-
ware for the LATEX typesetting system. I am also grateful to its developers.
iv
Declaration
I hereby declare that this is my original work, except where due reference is made,
and that this dissertation has not been submitted for any degree award in any other
university.
Signed: ———————————–
Bennett Freinderson Kankuzi (STUDENT)
v
Abstract
Spreadsheet systems are widely used and highly popular end-user systems. They are
highly popular because of the simplicity with which one can create spreadsheets. How-
ever, despite this simplicity in creating spreadsheets, they are generally difficult to
understand and comprehend. The need for understanding spreadsheets arises when
one wants to debug a spreadsheet as well when one wants to maintain or even just
to understand a spreadsheet created by others. One contributing factor to the diffi-
culty in understanding spreadsheets is due to the invisibility of the data dependencies
which are associated with cell formulas.
This research work aims to provide a graph-based visualization approach that can
simplify understanding and debugging of spreadsheets based on the MCL (Markov
Clustering) algorithm. The MCL algorithm helps in visualizing spreadsheet data-
flow graphs by generating clusters of cells. Navigation through graph clusters is pro-
vided through complementary techniques of compound fisheye views and treemaps.
More importantly, our experiments show that graph based visualization using the
MCL algorithm generates clusters which match with corresponding logical areas of a
spreadsheet. Identified MCL clusters are then dynamically highlighted in the original
spreadsheet using different cell background colours. Hence instead of looking at the
whole spreadsheet at once, the user focusses his/her attention at each highlighted
logical area at a time. The spreadsheet comprehension process is therefore properly
guided since the focus area matches with what the user might perceive to be a logical
unit.
vi
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 The end-user programming paradigm . . . . . . . . . . . . . . 1
1.1.2 Challenges in end-user programming . . . . . . . . . . . . . . 3
1.1.3 Popularity of spreadsheet systems . . . . . . . . . . . . . . . . 5
1.1.4 Importance of spreadsheets . . . . . . . . . . . . . . . . . . . 7
1.1.5 Impact of errors in spreadsheets . . . . . . . . . . . . . . . . . 8
1.1.6 Classification of errors in spreadsheets . . . . . . . . . . . . . 9
1.2 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Objectives of our research . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Overview of the rest of the Dissertation . . . . . . . . . . . . . . . . . 18
2 Related Work 19
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Spreadsheet error prevention techniques . . . . . . . . . . . . . . . . . 20
2.3 Spreadsheet error detection techniques . . . . . . . . . . . . . . . . . 22
2.4 Spreadsheet visualization techniques . . . . . . . . . . . . . . . . . . 25
vii
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Graph-based Visualization 38
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 The need for graph clustering . . . . . . . . . . . . . . . . . . . . . . 39
3.3 An overview of clustering algorithms . . . . . . . . . . . . . . . . . . 40
3.3.1 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . 42
3.3.2 Construction algorithms . . . . . . . . . . . . . . . . . . . . . 43
3.3.3 Hierarchical algorithms . . . . . . . . . . . . . . . . . . . . . . 43
3.3.4 Graph theoretical algorithms . . . . . . . . . . . . . . . . . . . 44
3.4 Choice of clustering algorithm . . . . . . . . . . . . . . . . . . . . . . 44
3.4.1 An overview of the MCL algorithm . . . . . . . . . . . . . . . 45
3.5 Choice of graph drawing software . . . . . . . . . . . . . . . . . . . . 48
3.5.1 Experiments with the ZGRViewer graph drawing software . . 49
3.5.2 Experiments with the Graphael graph drawing software . . . . 51
viii
5.4 Summary of experiment results . . . . . . . . . . . . . . . . . . . . . 80
6 Implementation 81
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 Software architecture of the visualization tool . . . . . . . . . . . . . 81
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7 Discussion 87
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.2 Spreadsheet understanding and comprehension . . . . . . . . . . . . . 87
7.3 The spreadsheet debugging process . . . . . . . . . . . . . . . . . . . 88
7.4 Spreadsheet maintenance . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.5 Addressing HCI aspects . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8 Conclusion 93
8.1 A summary of the research work . . . . . . . . . . . . . . . . . . . . . 93
8.2 Our contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Bibliography 102
Glossary 103
Appendix A 106
Appendix B 107
ix
List of Figures
x
4.1 The sample Project Accounting spreadsheet . . . . . . . . . . . . . . 53
4.2 Formula view of the Project Accounting spreadsheet. . . . . . . . . . 53
4.3 An illustration of a cluster tree . . . . . . . . . . . . . . . . . . . . . 54
4.4 A top-most level view of the cluster tree of the Project Accounting
spreadsheet data-flow graph as displayed using Graphael. . . . . . . . 54
4.5 Second level view of the cluster tree. . . . . . . . . . . . . . . . . . . 56
4.6 An MCL cluster containing cells D6, F6, G6, H6 . . . . . . . . . . . . 56
4.7 An MCL cluster containing cells F10, G10 and H10 . . . . . . . . . . 57
4.8 Treemap and cluster tree with Γ = 1.1 . . . . . . . . . . . . . . . . . 58
4.9 Treemap and cluster tree with Γ = 1.5 . . . . . . . . . . . . . . . . . 59
4.10 Treemap and cluster tree with Γ = 2.0 . . . . . . . . . . . . . . . . . 60
4.11 Treemap and cluster tree with Γ = 2.5 . . . . . . . . . . . . . . . . . 61
4.12 Treemap and cluster tree with Γ = 3.0 . . . . . . . . . . . . . . . . . 62
4.13 Treemap and cluster tree with Γ = 5.0 . . . . . . . . . . . . . . . . . 62
4.14 Treemap and cluster tree with Γ = 7.0 . . . . . . . . . . . . . . . . . 63
4.15 The Project Accounting spreadsheet showing highlighted MCL clus-
ters (when Γ = 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.16 The formula view of the Project Accounting spreadsheet with high-
lighted MCL clusters (when Γ = 2) . . . . . . . . . . . . . . . . . . . 64
4.17 The Consolidated Balance Sheet spreadsheet from the EUSES spread-
sheet corpus [25] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.18 The formula view of the Consolidated Balance Sheet spreadsheet . . . 67
4.19 A treemap and cluster tree for the Consolidated Balance Sheet de-
picting a cluster with cell members, F34, F35, F36, F37, F38, F39
and F40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.20 The Consolidated Balance Sheet with highlighted (shaded) MCL clus-
ters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.21 Formula view of the Consolidated Balance Sheet with highlighted
(shaded) MCL clusters . . . . . . . . . . . . . . . . . . . . . . . . . . 70
xi
5.2 The formula view of the Project Accounting spreadsheet. . . . . . . . 72
5.3 Microsoft Excel displays an error message for a cell in MCL cluster
number 5 in the Project Accounting spreadsheet. . . . . . . . . . . . 73
5.4 A sample IPO spreadsheet sourced from Ray Panko’s spreadsheet
research website[43] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5 The IPO spreadsheet with highlighted MCL clusters. . . . . . . . . . 77
5.6 The formula view of the IPO spreadsheet . . . . . . . . . . . . . . . . 77
5.7 IPO spreadsheet with an Microsoft Excel warning message . . . . . . 78
xii
List of Tables
4.1 MCL clusters for the Project Accounting spreadsheet with Γ = 1.1 . . 58
4.2 MCL clusters for the Project Accounting spreadsheet with Γ = 1.5 . . 59
4.3 MCL clusters for the Project Accounting spreadsheet with Γ = 2.0 . 60
4.4 MCL clusters for the Consolidated Balance Sheet spreadsheet . . . . 68
xiii
List of Algorithms
xiv
Chapter 1
Introduction
1.1 Background
1.1.1 The end-user programming paradigm
Computer end-users may be defined as people for whom conventional computer pro-
gramming is not their main job although they use computers as part of their daily
lives [10]. However, it is now common place to see computer end-users (hereafter
animations, web applications, simulations, just to mention but a few. Although end-
users are not professional programmers they might be experts in their professional
domains. Some of these end users are educators, scientists, engineers, business pro-
spreadsheet for recording student grades for a particular course. End-users may also
1
lem in comparison to manually solving the problem. For example, a mathematician
may write some program code using a mathematical software application to find a
solution to a complex differential equation. In all these cases, their main goal would
program code [37]. Pre-packaged software applications may not be suitable in these
individual and worse still, they cannot be customized to every individual’s needs
[40]. This need has led to the birth of the end-user programming paradigm.
The rising growth in the popularity of the end-user programming paradigm can be
attributed to the tools that have been developed to empower this kind of computer
users. For example, the development of the spreadsheet paradigm has led to many
users developing their own spreadsheets, hence doing some “programming”. An end-
• It should provide rapid feedback to the user as he/she works in the environment
2
model without violating the intended computational semantics
Statistically, it was estimated that, in the year 2005, there were 55 million end-user
programmers in the United States alone. This was about 20 times greater than the
estimated number of professional programmers [14, 55, 67]. These estimates clearly
indicate that a sizable amount of software produced in the whole world is developed
as their primary job function but rather to support their quest for achieving their
main goal such as accounting, doing office work, developing a web page, etc [40]
users are very prone to errors. This is because the programs are not developed
let alone try to learn the formal syntax and semantics of a particular programming
• Design barriers: the end-user programmer might not know what he/she wants
• Selection barriers: the end-user programmer might know what he/she wants
3
the computer to do but does not know how to choose an appropriate tool for
the task
tools for a particular task but he/she does not know how to make the tools
• Use barriers: the end-user programmer might know what tools to use for a
particular task but does not know how to use those tools
knows how to use a particular tool but unfortunately the tool does not do
• Information barriers: the end-user programmer might think that they know
Another major challenge in end-user programming is the reality that end-user needs
vary so widely such that one cannot come up with general design tools and languages
that can fit every end-user programmer’s needs [37]. It is also a major challenge
to make users understand the importance of the programs they develop [37]. This
is particularly true for non-trivial programs that have long life spans such that the
programs might need long-term maintenance. A case in point are some spread-
sheets that are not simple throw-away calculations but are continuously evolving as
4
how to develop tools that can capture the evolving program’s history and design [37].
In trying to address some of the challenges outlined above, some researchers have
ment environments which can help end-users achieve their goals through the use of
metaphors such as forms and spreadsheets [56]. Coupled with the statistics on the
number of end user programmers, it is easy to see the need for more research in
end-user programming.
used for trivial as well as non-trivial applications in private and public enterprises
[9, 45, 67]. They are used for a variety of important tasks such as mathematical
analysis and decision making. The provision of computational techniques that match
user’s tasks makes spreadsheet programming easier. There is also a trend in the
systems are widely used by end-user programmers not only due to their simplicity
but also due to their features which facilitate programming. The suppresion of low-
level details of programming, the immediate visual feedback and the availability of
high-level task specific functions are commonly referred features among many others
[35].
5
Spreadsheet systems allow computations to be defined by cells and their formu-
las. A cell’s value is defined solely by the formula explicitly given to it by the user
provide for copying of contiguous regions of cells from one physical area to another.
References between the cells may be either absolute or relative in either their hor-
izontal or vertical index. All copies of an absolute reference will refer to the same
row, column or cell whereas a relative reference refers to a cell with a given offset
[11]:
• Spreadsheets are usually associated with first-order functions only. Other tra-
a first-order function, the arguments are objects like numbers and class objects
6
The spreadsheet paradigm also differs from the procedural programming paradigm
in several ways:
• Spreadsheet programs are modeless in the sense that they do not require the
user to separately code, compile, link, and execute the spreadsheet program
ple, when a formula for a particular cell changes, the results are immediately
reflected [52].
tabular layout while the code for procedural programs is represented in a linear
fashion [6].
• From the point of view of a user, a spreadsheet program does not have clear
separation between input, computational code and output. This is not the
Despite the fact that some spreadsheet programs (hereafter spreadsheet program
calculations, many spreadsheets have been quite useful for business as well as per-
sonal endevours [67]. There are some large periodically used spreadsheets that are
ware [16]. This shows that end-user programming, with spreadsheet programming
7
as an example, can not be regarded as a trivial subject.
Panko also noted that another study found out that information generated from
This shows how critical non-trivial spreadsheets can be, to the running of a busi-
ness enterprise. Therefore errors in spreadsheets may lead to erratic decision making.
Spreadsheets have also been used in science and engineering disciplines such as
physics and chemistry, just to mention a few, because they are more usable than
procedural programs [16]. Another reason for spreadsheet usage in science and engi-
neering is the fact that spreadsheets already incorporate a way of displaying graphs
Errors in spreadsheet programs are non-trivial and costly [27, 45]. Despite this ob-
servation, there has not been quantitative data on the impact of spreadsheet errors.
For example, it is documented on the website that in 2004, some city officials, in one
8
of the cities in the United States, miscalculated the amount of sales taxes generated
at one of city’s parks during the first couple of months of its operation. The mistake
inflated the figures by tens of thousands of dollars, which in turn meant the total
sales estimates were overblown by millions of dollars. The mistake was attributed
It is also documented that some candidates for police officer jobs were told that
they had passsed an admission test when in fact they had failed. The reason for
this mishap was that the spreadsheet which the examiners had used to record the
It is also documented that mis-stated earnings of a company led to the stock price of
an online retailer to fall by 25 percent in a day and the Chief Executive Officer had
input in a spreadsheet was the cause of the mis-statement. These are just some
of the stories that underscore the fact that spreadsheet errors are non-trivial and
costly.
Data from spreadsheet field studies and laboratory experiments indicate that errors
in spreadsheets are indispensable. Panko [45] has tabulated data indicating error
rates in spreadsheets as produced by the authors of the various field audits and lab-
oratory experiments. The most important result of these studies is that spreadsheet
9
error rates are huge enough to tell us that most non-trivial spreadsheets will contain
errors.
Several classification schemes have been identified to categorize these errors depend-
ing on the context in which a researcher is doing the analysis [7]. Panko [45] identified
three categories of spreadsheet errors namely mechanical, logical and omission er-
rors. Mechanical errors are simple slips such as mistyping a number or pointing to a
wrong cell when entering a formula. Logical errors are defined as errors that occur
when a spreadsheet developer has a wrong algorithm for a particular formula cell.
On the other hand, omission errors are defined as errors that occur when a spread-
sheet developer does not have complete understanding of the problem at hand and
a spreadsheet. On the other hand, qualitative errors emanate from factors such as
poor spreadsheet design which may later cause problems in data entry or even lead
to incorrect data modifications and hence generate quantitative errors. This scheme
further categorizes quantitative errors into mechanical, omission and logical errors
10
There is another spreadsheet error classification scheme proposed by Ayalew et al
[7]. Unlike the other classification schemes given above, they do not want to catego-
rize the errors by their cause, but rather by the spreadsheet concept the errors seem
to be associated with. Thus, they have three categories of errors namely: physical
area related errors, logical area related errors and general errors.
Physical area related errors are defined as those errors that normally deal with miss-
ing values in a physical area or values of the wrong type somewhere in the physical
area. This kind of errors leads to several side-effects such as impacting on the results
if new values are added to the area. According to this classification scheme, physical
area related errors include what are termed as “reference to a blank cell/reference to
a cell with value of wrong type” errors, “incorrect physical area specification” errors,
A logical area is defined as an area that represents some kind of cohesion between
cells. It usually originates from copying from the same source multiple times. Ex-
amples of logical area errors include overwriting a formula with a constant value and
General errors have been defined as those errors that are not explicitly associated
with a physical or logical area and are usually made during formula definition. An
error might occur due to typographical errors or inability to formulate the necesssary
11
mathematical expression for a formula. An error might also occur due to incorrect
Despite the simplicity in creating spreadsheets, they are generally difficult to under-
stand and comprehend [17]. The need to understand a spreadsheet may arise if one
Most spreadsheets are created by end-users and they contain errors which the de-
velopers themselves may not easily notice [45]. Unfortunately, most spreadsheet
errors are not trivial considering the fact that key decisions, for example in business
firms, are based on information extracted from spreadsheets [27, 45]. Therefore, it is
important to help spreadsheet developers expose these errors or even prevent them
of the spreadsheet may be necessary since spreadsheets may need to maintained just
as any conventionally evolving application software [16]. However, for one to make
12
arrangement of numeric values with some accompanying explanatory text. Usually
this does not suffice for a third party to clearly comprehend and understand what
lated mainly with numerical values although every spreadsheet has a formula view
as well as an underlying data-flow graph [34] (see illustration in Fig. 1.1). A data-
“hidden” from the spreadsheet developer. It is therefore not surprising that most
spreadsheet developers view a spreadsheet as a word processor for numbers and not
necessarily as a complex data-flow graph that spreadsheets really are [15]. Despite
clarify the data dependencies among the cells [17]. In other words, visualizing and
clarifying the inherent data-flow graph can help users understand a spreadsheet as
well as aid in the spreadsheet debugging process. This is so, because human beings
13
process and understand visual representations of data much faster and in a more
Like the numerical view of a spreadsheet, the formula view of a spreadsheet has
also some disadvantages. For example, the formulas which compute the values of
cells are hidden. It is possible to see either the formulas or the values but not both
at the same time. For a single cell, it is possible to see both at the same time but
this does not give much information about the overall structure of the spreadsheet.
In some cases, this locality to a single cell may help by narrowing the point of focus
instead of dealing with the spreadsheet as a whole, but it is also difficult to get sense
of the general structure of the whole spreadsheet [32, 42]. As a result, it is difficult
to identify where data comes from and where it goes unless one makes a detailed
Therefore, it is against this background that this research work was embarked on
with the aim of developing a tool for visualizing spreadsheet data-flow graphs that
debugging.
14
(i) We want to generate the data-flow graph of a given spreadsheet with nodes rep-
cells which can make the visualization (the generated data-flow graph) to be
since normally the number of nodes and edges in the generated graph becomes
(ii) We would like to deal with the problem of visualizing large graphs through
the generated clusters shall also be an important aspect of this work. More
match with logical areas of the given spreadsheet. A logical area in a spread-
sheet may be defined as a group of cells in a spreadsheet that from the spread-
sheet creator/user perspective form a logical unit due to the semantics of the
spreadsheet.
(iii) We would like to separate the graph-based visualization from the spreadsheet so
(iv) We would like to generate our visualization dynamically so that we are able to
15
1.4 Research Methodology
the context of this research work. An experiment shall involve the running of a
computer program multiple times while varying either program inputs or program
parameters and observing the program outcomes. Basing on the observation of the
tions in experiments are very important because they can lead to new useful and
unexpected insights that can also open new areas of investigation [59]. We use
different spreadsheets
algorithm.
phenomenon within its real-life context [58]. In software engineering, case studies
are useful for the industrial evaluation of different software engineerng tools and
methods [58]. For example, different software tools may be evaluated on how their
16
features may be suitable in accomplishing a particular task. Hence to avoid bias and
to ensure internal validity, a valid basis is identified to assess the results of the case
study [58]. However, case studies have the disadvantage in that results may not be
generalized easily [59]. In this work, we use spreadsheets sourced from the EUSES
Spreadsheet Corpus [25] and the Spreadsheet Research website [43]. This is because
firms.
a glance. Prototyping may also offer a demonstration that theoretical ideas can be
put into a “real-life” software tool or product. In other words, prototypes provide
proof-of-concepts and they may also provide incentives to study a research question
further [59]. However, it is important to note that prototypes do not provide solid
Java based graph drawing software. Programming in the Microsoft Excel spread-
sheet system is done in Visual Basic for Applications (VBA). We also modify the
source code of the graph drawing software to suit the requirements of our applica-
tion.
17
1.5 Overview of the rest of the Dissertation
related research works by other researchers in this research area. Our graph-based
the MCL algorithm on spreadsheets using the Graphael graph drawing software is
given in Chapter 4. We demonstrate how clusters identified using the MCL algo-
tool is presented in Chapter 6. A discussion of the results from this research work
as well as a discussion of some issues that emerged from this research work is given
18
Chapter 2
Related Work
2.1 Introduction
Considering the importance of spreadsheets, several research works have been un-
This growing research direction is being embodied in a new and growing discipline
known as end-user software engineering [13, 39, 51, 56]. Some of the research ques-
• How can software engineering life cycle models be used in spreadsheet devel-
opment?
• How can improved programming practices such as teamwork and code inspec-
tion help in creating error-free spreadsheets? Some work in this area includes
19
that of Panko and Sprague [44, 46] which explored on the benefits of code
• Development of tools and techniques that can help in testing, debugging and
using “interval testing” and slicing [8], using type inference to identify pro-
Several research endevours have already reported on techniques that can be used
to prevent errors from happening in spreadsheeets. The rationale for this research
path being the fact that it is easier to prevent than correct errors in spreadsheets.
of preventing errors in spreadsheets. The basis of this proposal was that a lack
flow diagrams would help the designer structure the spreadsheet solution model to
20
a problem. Spreadsheet flow diagrams could also assist in communicating the struc-
ture of a spreadsheet model to others and they could also serve as a documentation
Some researchers have also proposed data control techniques as one way of pre-
venting errors from occuring in spreadsheets (e.g. Panko [45]). Some proposed data
• protection of cells and worksheets from unauthorized use. For example, cell
protection can allow users to change only pre-specified input cells so that if a
user attempts to “hardwire” a formula cell, they will be prevented from doing
so. In hardwiring a formula cell, a user cursors to a formula cell and enters
a number in the cell. This usually happens when a user does not realize that
the cell was a formula cell and they think that they should just enter a value
in the cell.
• provision of data entry validation through the re-keying of input data. This
method is also used in traditional data processing and it is called data veri-
fication. This method easily prevents errors from occuring since it is easy to
check if two input areas are the same and if not, it is also easy to determine
Erwig et al. [23] developed a system called Gencel in which spreadsheet templates
using the Visual Template Specification Language (ViTSL) are used to generate
spreadsheets which are free from reference, range or type errors. With this technique,
21
spreadsheet templates are created and verified by domain experts and later on can
the template. This concept was extended to include the automatic generation of
terval testing” and slicing. In this technique, each formula cell has a user-specified
value interval and a system-generated value interval. When the user-specified in-
terval and the system-generated interval for a cell do not agree with the actual
fault tracing strategy is then used to identify the most influential faulty cell from
the cells perceived by the system to contain faults. This is based on the number of
Rothermel et al. [50] also developed a spreadsheet testing methodology which they
termed “What You See Is What You Test” (WYSIWYT) to help users test spread-
22
sheets. Since testing and debugging are closely interrelated, we find it worthwhile
and coverage criteria to give the user feedback on how well tested a spreadsheet
is. The WYSIWYT testing methodology has been integrated with another spread-
sheet testing technique known as the “Help Me Test” (HMT) [24] technique into the
test cases for the user as he/she actively works on the spreadsheet. Forms/3 is a
The Forms/3 spreadsheet language also allows users to define assertions on the ex-
pected cell values [12]. To promote the usage of assertions by end-user programmers,
Randolph et al. [48] developed a spreadsheet verification tool based on the WYSI-
WYT methodology. Their main emphasis was to use the WYSIWYT methodology
Abraham and Erwig [2] developed an automated reasoning system for spreadsheets
called UCheck. UCheck infers header unit information for cells in a spreadsheet.
Based on the header unit information, the system identifies cells in the spreadsheet
that contain erroneous formulas. They extended the UCheck system to produce a
system known as UFix [4] in order to improve on the way error messages are re-
23
ported to users hence improving the spreadsheet debugging process. Abraham and
Erwig also developed a type system and a type inference algorithm for spreadsheets
Abraham and Erwig [3] also developed a spreadsheet debugger known as GoalDebug
to mark cells with incorrect outputs and specify the expected output. The GoalDe-
bug system then generates a list of change suggestions, any one of which when
applied would result in the expected ouput being computed in the marked cell. The
generated change suggestions are ranked based on a set of heuristics before being
presented to the user. The generated change suggestions can be automatically ap-
plied and hence eliminating errors that can be introduced by end users through
test spreadsheets [14]. This technique has also been used to test other end-user de-
A metamorphic relation is any relation among program inputs and the outcomes of
the target program using isomorphic test cases are supposed to match, otherwise
the tested program is at fault. A good metamorphic relation can be identified eas-
24
ily by a program tester who has black-box knowledge of the problem domain and
Various spreadsheet visualization tools have also been proposed for different pur-
these spreadsheet visualization tools are based on the data-flow graph behind the
modelled as objects and connections between those objects [60]. In a graph, the ob-
jects are represented by nodes and edges are used to represent connections (reation-
ships) between those objects. Automatic generation of graph drawings has been
• the World Wide Web: visualization of site maps and construction of browsing
25
Our research work focusses on the visualization of spreadsheet structures through
either get precedents or dependents of a particular cell. Arrows are then drawn
linking the precedents or dependents to the selected cell. Fig. 2.1 shows a Microsoft
Excel spreadsheet with arrows depicting the data-flow graph as generated by the
tracer tool. The formula view of the spreadsheet is given in Fig. 2.2. One prob-
Figure 2.1: A Microsoft Excel spreadsheet with data-flow graph arrows. Sourced
from [53].
lem with this tool is that one can not get the overall data-flow graph for the whole
spreadsheet at a single request. Therefore one cannot have a global view of the
overall data-flow graph in a single step. Another major drawback with this kind of
This clutters the spreadsheet view and as a result reduces readability and compre-
26
Figure 2.2: Formula view of the Microsoft Excel spreadsheet depicted in Fig. 2.1.
Davis [17] produced two spreadsheet visualization tools: the arrow tool and the
online data dependency diagram. The arrow tool is similar to the earlier versions
of the Microsoft Excel (MS Excel 97) precedents/dependents tracer tool with the
exception that the arrow tool coloured precedent and dependent cells in addition
spreadsheet display hence bringing in problems associated with the Microsoft Ex-
flow-chart like diagrams (see Fig. 2.3). Distinctive symbols are used to represent
parameters of formulas. Arrows are used to show data dependencies amongst the
the spreadsheet display as in the other tools explained in the preceeding paragraphs.
27
Instead, the tool displays the spreadsheet in a window on one side of the screen and
the diagram in a separate window on the other side as in Fig. 2.3. However, it has to
be noted that the visualization is statically generated. As a result, the tool’s author
as a practical spreadsheet auditing tool because one could produce it when needed.
Davis continues to state that the visualization was statically generated because at
Figure 2.3: A spreadsheet with its corresponding online data dependency diagram
the time the visualization was proposed, there were not good enough graph drawing
algorithms. This is not the case right now and therefore we would like to exploit
the availability of such robust graph drawing algorithms for automatic (dynamic)
On a related note, Vemuri et al. [64] conducted an experimental study on the use-
their study did not conclude that online data-dependency diagrams were useful, their
28
Figure 2.4: An animated presentation of fluid-like flow of data in a spreadsheet by
Igarashi et al.
Igarashi et al. [34] also developed a visualization tool that depicts a fluid-like flow
visualization tool is the visualization of the hidden data-flow structure behind the
tabular layout of a spreadsheet. Transient local views are used to visualize data-flow
structures associated with individual cells while it is possible to view the data-flow
structure of the entire spreadsheet at once. A user is also able to navigate through
the data flow structure interactively and it is possible to construct formulas using
graphical editing techniques hence the provision of visual editing. However the main
drawback with the tool is that it fails to scale on spreadsheets containing more than
400 used cells because there is noticeable degradation in performance with more
than 400 used cells. This limits the application of the technique to larger spread-
29
sheets.
logical areas or semantic units in a spreadsheet are highlighted and data-flow be-
tool is given in Fig. 2.5 and the corresonding formula view of the spreadsheet is given
Highlighted areas in the visualization describe the plan structure of the spreadsheets
and deviations from this structure show clearly in the visualization hence helping in
the spreadsheet debugging process. Both tools have the disadvantage that they are
single step.
On the same line, Sajaniemi further suggested that spreadsheet visualization tools
30
Figure 2.6: A formula view of the spreadsheet given in Fig. 2.5.
display clutters the view of the spreadsheet. Therefore our work shall avoid
like to use tools that require less user intervention. Our visualization tool shall
Ayalew et al. [7] proposed a graphical spreadsheet visualization model that is not
only based on a data-flow graph but also on visualizing logical and physical areas
proposed that the visualization should allow zooming into specific areas of the gen-
31
erated graph without losing the global view of the graph using fisheye views. Their
• shortening the trial and error process to develop solutions for real-world prob-
based on model properties such as data-flow, physical and logical areas and
spreadsheet into logical areas known as equivalence classes. The equivalence classes
are then highlighted in the original spreadsheet as in Fig. 2.7. The toolkit has three
components:
• A dependency viewer that displays the data flow graph between the dependen-
cies of the cells that are in the equivalence classes that is currently highlighted
lighting the cells that are in the equivalence class (logical area) that is currently
32
selected in the structure browser.
With large spreadsheets (e.g. having more than 5000 used cells), the number of
equivalence classes becomes too large and hence they devised a further abstraction
generated graph and data-flow between cells in different semantic classes is repre-
sented by directed edges as in Fig. 2.8. These graphs are not dynamically generated
since information about a spreadsheet (e.g. cell dependencies) is extracted and pro-
Ballinger et al. [9] developed a spreadsheet visualization tool that would first stat-
ically extract artefacts from spreadsheets and then convert this information into
flow diagram is depicted in Fig. 2.9. Ballinger et al. also used a hyperbolic viewer
to view the generated spreadsheet data-flow graphs in an attempt to deal with the
33
Figure 2.8: A data-flow graph of semantic classes as proposed by Clermont et al.
problem of cluttering in graphs with a large number of nodes and edges (see Fig.
2.10). Unfortunately, hyperbolic viewing does not provide for views in which the
current view displays nodes which match with logical areas in the corresponding
spreadsheet. We want to produce graph views which match with logical areas in the
spreadsheet.
2.5 Summary
The aforementioned spreadsheet visualization tools and techniques indeed offer very
useful insights about data-flow as well as data patterns in spreadsheets which would
not have been possible by just analysing the “data value” view of a given spreadsheet.
However, as already pointed out, there are some drawbacks with the aforemen-
34
Figure 2.9: A sample spreadsheet data-flow graph by Ballinger et al.
35
tioned approaches. For example, in some of the approaches (e.g. the Microsoft
precedents/dependents tracer tool, the arrow tool [17] and the S2 and S3 [54] vi-
sualization), the generated arrows with highlighted areas are superimposed on the
spreadsheet display which introduces cluttering on the display. In the other ap-
large spreadsheets (hence large graphs) is not adequately handled. For instance,
the fluid visualization tool of Igarashi et al. [34] can only handle spreadsheets with
not more than 400 used cells. Hyperbolic viewing of spreadsheet data-flow graphs
as in the work of Ballinger et al. [9] has the problem that the viewing context
generated by the fisheye views employed does not necessarily match with logical
areas in the corresponding spreadsheet. Another drawback with some of the ap-
proaches is that the visualizations produced are statically generated hence cannot
work of Clermont et al. [16, 38] and that of Ballinger [9] where information about a
spreadsheet (e.g. cell dependencies) is extracted and processed separately from the
spreadsheet. Online data dependency diagrams as proposed by Davis [17] are also
To sum up, we can identify the following problems with the aforementioned spread-
36
• Some of the approaches do not adequately handle the visualization of large
spreadsheets
37
Chapter 3
Graph-based Visualization
3.1 Introduction
A graph consists two finite sets, V and E. Each element of V is called a vertex
or a node. The elements of set E are called edges and these are unordered pairs
and edges can be identified as relations between the objects [29]. In our approach,
cells and the set of edges represents the dependencies between spreadsheet cells as
Using different graph drawing techniques, one can generate the data-flow graph
of any given spreadsheet. However, the graph becomes difficult to comprehend and
navigate through due to cluttering of the graph which arises due to the large number
of nodes. Grouping of the nodes into clusters is a viable solution to this problem.
38
3.2 The need for graph clustering
Consider the spreadsheet given in Fig. 3.1 whose formula view and corresponding
data-flow graph are given in Fig. 3.2 and Fig. 3.3 respectively. The spreadsheet is
used to track income and expenditure on several projects being run by some com-
pany. It is worth noting that it is very difficult to comprehend the data-flow graph of
the spreadsheet. There are problems of readability and navigation just to mention
but a few. This is a general problem of visualizing large graphs [1, 28] since large
graphs contain a large number of nodes. As already stated above, graph clustering
at a time.
The process of coming up with clusters is known as cluster analysis and it can
be broken down into a series of steps [29]. However, when applying a cluster anaysis
• When are two entities said to be similar? This is the classification criteria that
• What is the basis for valuating the classification criteria? This is important
39
ters which are “natural” as much as possible.
In our case, we need a clustering algorithm that would find clusters that corre-
spond to the logical areas in the given spreadsheet. In other words, the clustering
There are so many clustering algorithms. However, clustering algorithms can roughly
• optimization algorithms
40
Figure 3.2: The formula view of the Project Accounting spreadsheet
Figure 3.3: A data flow graph of the given Project Accounting spreadsheet generated
by the Graphael graph drawing software.
41
• construction algorithms
• hierarchical algorithms
This categorization is not exhaustive as there are some algorithms which might not
fall in the listed categories. There are also other algorithms which are a hybrid of
other algorithms. However, it is important to note that the algorithms may either
split into non-overlapping subsets. Thus, every element of the sample is attached to
exactly one cluster. On the other hand, for overlapping clusters, a sample of objects
vised algorithms are provided a priori knowledge while this is not the case with
at the optimal value of the quality function. The main drawback with optimiza-
tion algorithms is the computation time needed to find an optimum of the quality
function.
42
3.3.2 Construction algorithms
clusters. These representative elements are called kernels. Using an iterative process,
all elements which are geometrically nearest to each kernel are attached to the
kernel’s group. The process stops if the clusters become too heterogeneous. During
the iterative process, elements which get closer to the geometric centre of a new
group than to the centre of the group they previously have been attached to are
drogram) whereby each level in the hierarchy contains the same clusters as the first
lower level except for two clusters which are joined to form one cluster. Hierarchical
clusters which are at a lower level in the hierarchy. The starting point are single-
membered clusters which are at the lowest level of the hierarchy. On the other hand,
the clustering process by having all entities contained in one cluster. Thereafter,
43
in each iterative step, a cluster is split into two clusters until the lowest level of
Graph theoretical algorithms work on graphs whereby nodes represent entities and
edges represent entity relations. These algorithms do not start from the individual
nodes but they try to find subgraphs which will form clusters. Examples of sub-
graphs include connected components and spanning trees. The algorithms used to
find these subgraphs are based on graph theory. Some graph theoretical algorithms
reduce the number of nodes in a graph by merging them into aggregate nodes which
can be interpreted as nodes or can be used as input for a new iteration resulting in
In our case, we need a clustering algorithm that would find clusters that correspond
form a logical unit due to the semantics of the spreadsheet [35, 36]. The seman-
tics of a spreadsheet define what the spreadsheet is all about (the meaning of the
44
the spreadsheet data-flow graph.
Based on our experiments, we found out that the Markov Clustering (MCL) algo-
rithm [61, 62] finds “natural” clusters in spreadsheet data-flow graphs. We present
in a graph are characterised by the presence of many edges between the members
of that cluster and one expects that random walks on the graph will infrequently
go from one natural cluster to another [61, 62]. Due to its ability to find natural
clusters, the MCL algorithm has also been used in many advanced applications. For
example, the algorithm has been reliably used for the assignment of proteins into
graph. A column stochastic matrix is a matrix whose column vectors are probabili-
The first step of the algorithm is to associate a given input graph with some column
stochastic matrix, M , such that entry Mij will indicate the probability of moving
from node j to node i in the input graph (note that we start columnwise). Then two
operations known as expansion and inflation are performed iteratively starting with
the associated stochastic matrix thus simulating random walks through the input
45
graph.
An expansion operation is carried out by taking the power of the associated stochas-
tic matrix using the normal matrix product. An inflation operation involves tak-
ing the Hadamard power of the matrix result from the expansion operation. The
Hadamard power of a matrix is computed by taking the powers of each matrix entry.
The Hadamard power of the matrix is specified using what is known as the inflation
we have a stochastic matrix again. The process of expansion and inflation are then
repeated iteratively jointly together. The iterative process is stopped after we get
Expansion computes random walks of higher length paths. That is, given any pair
having a higher length path between the two nodes. But since we have more higher
length paths within clusters than between different clusters, node pairs in the same
cluster will have large probabilities since there are so many ways of going from one
node to the other. The probabilities of random walks with higher length paths are
further boosted by applying inflation operation. Thus inflation boosts the probabil-
ities of intra-cluster walks and demotes inter-cluster walks. Intra-cluster walks are
46
The process of jointly iterating expansion and inflation results in a very sparse
stochastic matrix which is interpreted as the separation of the input graph into dif-
graphical representation of the MCL cluster separation process is given in Fig. 3.4.
Figure 3.4: An example MCL cluster separation process from van Dongen [61].
• the number of higher-length paths between two arbitrary nodes in the cluster
is large than between two arbitrary pair of nodes from different clusters
• if one takes a random walk through a dense cluster then the random walker
will likely not leave the cluster until many of its nodes have been visited.
47
The basic MCL algorithm is given in Algorithm 1 below. It is important to note
that the inflation operator can be altered using the parameter Γ. Increasing this
parameter has the effect of making the inflation operator stronger, and this increases
note that the MCL algorithm has been proven to converge quadratically. In practice,
grams of abstract graphs and networks. It is tedious to draw such kind of graphs
by hand. Therefore, automatic drawing of these kind of graphs is done using graph
drawing software. Graph drawing software usually have a variety of graph layout
algorithms. Different graph drawing software has been used in a wide variety of im-
48
We investigated two open-source Java-based graph drawing programs in this work.
These are ZGRViewer [47] and the Graphael [26, 30] programs. Each of the programs
We used open-source programs because they were not only free in terms of monetory
costs but most importantly because we were able to modify the source code of the
ware
ically aimed at displaying graphs expressed in the DOT graph modelling language
using the GraphViz [20, 31] graph drawing library. ZGRViewer is designed to han-
dle large graphs, and offers a zoomable user interface (ZUI), which enables smooth
shot of the same spreadsheet data-flow graph is given in Fig. 3.6. Despite the
fact that ZGRViewer is able to efficiently handle large graphs through smooth and
some shortcomings:
• The whole context of the graph is lost as one zooms in to get a detailed view
• To deal with the problem of visualizing large graphs, graph clustering becomes
49
Figure 3.5: A screenshot of the ZGRViewer graph drawing software displaying an
unzoomed data-flow graph of a spreadsheet.
50
a potential solution. Graph clustering allows us to view a subset of the whole
them.
ware
The Graphael program has a number of graph clustering algorithms. The Graphael
program has a geometric graph clustering algorithm as well the MCL algorithm
Needless to say, the geometric clustering algorithm does not produce clusters that
match with logical areas in the corresponding spreadsheet. On the other hand, our
experiments with the MCL algorithm showed that the clusters produced would in
most cases match with logical areas in the corresponding spreadsheet. A detailed
51
Chapter 4
4.1 Introduction
We also determined whether the “natural clusters” match with logical areas in the
Graphael
For the Project Accounting spreadsheet, given Fig. 4.1 and its corresponding for-
mula view given in Fig. 4.2, we generated its corresponding data-flow graph using
the Graphael program. The spreadsheet is used to track income and expenditure
for some projects being run by some company. To avoid the problem of graph clut-
52
the generated clusters are hierarchically arranged in a cluster tree. An illustration
of a cluster tree is given in Fig. 4.3. The leaves of a cluster tree are the actual
nodes of the generated graph while the rest of the higher-level nodes of the cluster
tree represent clusters. The root of the tree is the highest-level cluster of the graph.
In the Graphael program, navigation through the cluster tree is achieved by using
compound fisheye views and treemaps [1]. Fisheye views are a graph visualization
technique which allows one to view a graph as a whole at once while at the same
time providing the ability to the viewer to see detailed parts of the graph without
53
Figure 4.3: An illustration of a cluster tree
Figure 4.4: A top-most level view of the cluster tree of the Project Accounting
spreadsheet data-flow graph as displayed using Graphael.
losing the overall context of the graph. Compound fisheye views is a fisheye view
cluster while at the same time showing any relationships between the cluster mem-
bers and the rest of the clusters in the cluster tree. On the other hand, treemaps
decomposition. In our case, the cluster tree is also displayed using nested rectangles.
54
Using the Graphael program, the cluster tree of the spreadsheet data-flow graph
is visualized using two windows which are displayed side by side as in Fig. 4.4.
The right-side window is the cluster window while the left-side window is a treemap
window. The cluster window is displaying the root node of the cluster tree which is
tree navigation aid because it not only helps in determining the level we are at while
navigating the cluster tree in the cluster window but it also indicates the number
of nodes which are in a selected cluster. We know the level we are at when using a
treemap window by counting the number of thickened rectangular borders from the
Clicking on the root node of the cluster tree as depicted in the cluster window
in Fig. 4.4 leads to the display of the nodes (clusters) at the next lower level of the
cluster tree as depicted in Fig. 4.5. On the other hand, right-clicking on any node
in the currently displayed cluster leads to viewing of nodes which are at the next
In our case, a look at the corresponding treemap in Fig. 4.5 shows that the next
lower-level nodes are leaf nodes. Therefore, clicking on any node (cluster) in Fig.
4.5 should lead to leaf nodes in that particular cluster. For example, in Fig. 4.6,
we have a cluster containing cells D6, F6, G6 and H6 and these are depicted by la-
belled nodes. The unlabelled nodes indicate clusters which are not currently under
55
selection. Fig. 4.7 depicts a cluster with cells F10, G10 and H10.
Compound fisheye views help us to know the relationship between cluster members
currently being viewed in relation to other cluster members and unselected clusters.
For example, in Fig. 4.6, cluster member G6 is linked to three nodes: F6, D6 and
an unlabelled cluster. We can view these details without loosing the overall context
56
Figure 4.7: An MCL cluster containing cells F10, G10 and H10
algorithm
The size of MCL clusters is dependent on the value of the inflation operator [62].
According to the MCL algorithm, the inflation operator, Γ, has to be greater than
1 (Γ > 1). As Γ values get larger, we would expect tighter (smaller-sized) clusters.
We then setout an experiment to find the best value of the inflation operator, Γ,
that gives us MCL clusters that closely match with logical areas in a given spread-
sheet. We used the same Project Accounting spreadsheet given in Fig. 4.1 for our
experiments. The formula view of the spreadsheet is also given in Fig. 4.2. The
The corresponding treemap and cluster tree is depicted in Fig. 4.8. The resulting
57
clusters are summarized in Table 4.1. Clearly, we do not get any meaningful MCL
clusters (compare tabulated clusters with the formula view of the spreadsheet in
Fig.4.2).
Table 4.1: MCL clusters for the Project Accounting spreadsheet with Γ = 1.1
The corresponding treemap and cluster tree is given in Fig. 4.9. Refer to Table 4.2
clusters with the formula view of the spreadsheet in Fig.4.2 shows a mismatch with
58
Figure 4.9: Treemap and cluster tree with Γ = 1.5
Table 4.2: MCL clusters for the Project Accounting spreadsheet with Γ = 1.5
The corresponding treemap and cluster tree is depicted in Fig. 4.10. Identified MCL
clusters with Γ = 2.0 are listed in Table 4.3. A comparison of the identified clusters
with the formula view of the spreadsheet show matches with most logical areas n
the spreadsheet.
Consider the treemap (left window) in Fig. 4.11. It is clear from the treemap that
we have so many MCL clusters which have either one member, two members or three
members. For example, one single membered cluster contains cell F7. An example
59
Figure 4.10: Treemap and cluster tree with Γ = 2.0
Table 4.3: MCL clusters for the Project Accounting spreadsheet with Γ = 2.0
60
of an identified two membered cluster contains cells, E10 and I10. An example of a
three-membered cluster is a cluster containing cells, B12, B13 and B14. Clearly we
We get the treemap and cluster tree as in Fig. 4.12. Again, we have so many
smaller-sized clusters.
We get the treemap and cluster tree as in Fig. 4.13. Again, we have so many
smaller-sized clusters. The largest cluster in this case has got only three member
61
Figure 4.12: Treemap and cluster tree with Γ = 3.0
We get the treemap and cluster tree as in Fig. 4.14. Again, we have so many
smaller-sized clusters. The largest cluster in this case has got only two member cells
62
Figure 4.14: Treemap and cluster tree with Γ = 7.0
Using the same analysis technique as with the Project Accounting spreadsheet, we
extended our experiment with more spreadsheets. Our results show that the in-
flation operator, Γ = 2, gives clusters that better match with logical areas in the
spreadsheet. Values less than 2 (Γ < 2) give us bigger-sized clusters which do not
match with logical areas in the spreadsheet. On the other hand, values greater than
2 (Γ > 2) give us many smaller (tighter) clusters which are not useful either.
indicated Table 4.3 are highlighted with different cell background colours and cell
63
Figure 4.15: The Project Accounting spreadsheet showing highlighted MCL clusters
(when Γ = 2)
Figure 4.16: The formula view of the Project Accounting spreadsheet with high-
lighted MCL clusters (when Γ = 2)
more spreadsheets
To test the efficacy of the MCL algorithm, we run the algorithm on one more spread-
sheet while maintaining the inflation operator, Γ = 2. The sample spreadsheet used
64
is the Consolidated Balance Sheet depicted in Fig. 4.17. The formula view of the
spreadsheet is given in Fig. 4.18. A treemap and cluster tree for the spreadsheet
depicting a cluster with cell members F34, F35, F36, F37, F38, F39 and F40 is
Table 4.4 is a summary of identified MCL clusters for the spreadsheet. For each
cluster, we also determine the degree of conformance for each cluster. We define
the degree of conformance in terms of the number of cells in an MCL cluster and
the number of cells which are supposed to be in the corresponding logical area in a
spreadsheet. For example, the degree of conformance for cluster 2 is 6/8. This is
interpreted as follows: The corresponding logical area for this cluster is supposed
to have 8 cells, but cluster 2 contains only 6 of the 8 cells. A similar interpretation
goes for all the other clusters indicated in Table 4.4. The identified MCL clusters
for the Consolidated Balance Sheet spreadsheet are then highlighted as in Fig. 4.20
The MCL algorithm was able to identify clusters that match with logical areas in the
in clusters 2, 5, 8 and 12 occur because a cell can only belong to one MCL cluster
where the cell has higher probability of being visited in a random walk. Based on
65
Figure 4.17: The Consolidated Balance Sheet spreadsheet from the EUSES spread-
sheet corpus [25]
66
Figure 4.18: The formula view of the Consolidated Balance Sheet spreadsheet
67
Figure 4.19: A treemap and cluster tree for the Consolidated Balance Sheet depicting
a cluster with cell members, F34, F35, F36, F37, F38, F39 and F40
Table 4.4: MCL clusters for the Consolidated Balance Sheet spreadsheet
68
Figure 4.20: The Consolidated Balance Sheet with highlighted (shaded) MCL clus-
ters
69
Figure 4.21: Formula view of the Consolidated Balance Sheet with highlighted
(shaded) MCL clusters
70
Chapter 5
5.1 Introduction
One of the goals of our spreadsheet visualization tool is to aid in the comprehension
clusters can be used to serve that purpose through a process of cluster member verifi-
cation. Cluster member verification involves verifying whether the identified clusters
belong to their respective logical areas. The aim of this process is to comprehend
and understand a spreadsheet as well as identify errors (if any) in the spreadsheets.
sheet
We again consider the Project Accounting spreadsheet in Fig. 5.1 and its corre-
sponding formula view is given in Fig. 5.2 below. Identified MCL clusters for the
spreadsheet given in Table 5.1. The Project Accounting spreadsheet with high-
71
lighted MCL clusters is also given in Fig. 5.3 with a captured Microsoft Excel error
message.
ing spreadsheet
Referring to Table 5.1 in conjunction with the the formula view of the Project Ac-
counting spreadsheet in Fig. 5.2, clusters 1 to 9 have members which indeed from
72
Cluster No. Member Cells
1. D6, F6, G6, H6
2. D7, F7, G7, H7
3. D8, F8, G8, H8
4. D9, F9, G9, H9
5. F10, G10, H10
6. D11, F11, G11, H11
7. D12, F12, G12, H12
8. D13, F13, G13, H13
9. D14, F14, G14, H14
10. E5:E15, I14
11. F5, F15
12. G5, G15
13. H5, H15
14. I5, I6
15. I7
16. I8
17. I9
18. I10
19. I11
20. I12
21. I13
22. B5, B6, B7, B8
23. B9, B10
24. B11, B12, B13, B14
Figure 5.3: Microsoft Excel displays an error message for a cell in MCL cluster
number 5 in the Project Accounting spreadsheet.
73
the user’s point of view fall in their respective logical areas. For example, for cluster
1, cells D6, F6, G6 and H6 are in a logical area relating to “Ted” (see Fig 5.2). We
Cluster 5 may draw some interest since its members do not follow the pattern of its
neighbouring clusters. This is because even if we look at the formula view in Fig.
4.16, the formulas of these cells are structurally different from the cells in neigh-
bouring clusters. This is not an error despite the fact that Microsoft Excel produces
Cluster 10 has cell range E5:E15 and cell I14. Cell I14 seems to be the odd one
out in this cluster. But this should not be the case. Cell I14 belongs to this cluster
through the cell dependency as defined by the formula I14=I13+E14-F14 (see for-
mula view in Fig. 5.2). This is an example of a case where it may not be obvious to
the user that cluster members belong to the same logical area. We have this phe-
nomenon because a cell may belong to more than one logical area e.g. column-wise
and row-wise. However, it will only belong to one and only one MCL cluster at a
time. The cell will belong to the cluster where there is higher probability of being
visited in a random walk as defined by the MCL algorithm. In this case, cell I14
Cluster 11, cluster 12 and cluster 13 are also in their respective logical areas. How-
ever, one would expect that for example, that cluster 11 would have cell range
74
F5:F14 as part of this cluster yet cluster 11 has cells F5 and F15 only. Cells F5 to
F14 are members of other clusters. The reason for this phenomenon is that a cell
can only belong to one cluster at a time and it will belong to the cluster where it
has higher probability of being visited in a random walk as defined by the MCL
algorithm. The same explanation goes for clusters 11, 12 and 13.
Cluster 14 has two cells, cell I5 and cell I6, which are connected by the formula
view suggests that they should belong to one logical area at least from the user’s
perspective.
From the user’s perspective, clusters 22, 23 and 24 should belong to one logical
area containing the cell range B5:B14. However, the MCL algorithm has unnec-
essarily split the logical area into three different clusters. This is an example of a
case where an MCL cluster could sometimes not necessarily match with the user’s
From Table 5.1, we could say that clusters 1 to 21 match the user’s perspective
of logical area while cluster 22, cluster 23 and cluster 23 provide a mismatch. This
75
5.3 Analysis of the IPO spreadsheet
We also consider the IPO spreadsheet given in Fig. 5.4. The spreadsheet is used
to calculate income after tax for some company. This spreadsheet has been seeded
with errors. We list in Table 5.2, the clusters defined by the MCL algorithm for the
IPO spreadsheet. Identified MCL clusters for the IPO spreadsheet are highlighted
in Fig. 5.5. The formula view of the spreadsheet with highlighted clusters is given
in Fig. 5.6.
Figure 5.4: A sample IPO spreadsheet sourced from Ray Panko’s spreadsheet re-
search website[43]
Table 5.2: MCL clusters for the IPO spreadsheet given in Fig. 5.4
76
Figure 5.5: The IPO spreadsheet with highlighted MCL clusters.
77
Figure 5.7: IPO spreadsheet with an Microsoft Excel warning message
Identified MCL clusters presented in Table 5.2 are highlighted in Fig. 5.5. Cluster 1
has two cells, B6 and B20. A look at the formula view of the IPO spreadsheet in Fig.
5.6 indicates that two cells are connected by the formula B20=B6*B17. According
to the MCL algorithm, these two cells belong to the same cluster. But it is up to
the discretion of the user to ask themselves if indeed the cells should really belong
In this case, a user should notice that probably what was intended was that “Taxes”
(B20) should be calculated from “Corporate income tax rate” (B5) and “Sales Rev-
enues” (B17) and not from ‘Depreciation rate” (B6) and “Sales Revenues”. Thus
the end-user would notice that this is an error and hence corrections can be made
78
that cell B20 should have the following formula B20=B5*B17. Hence the MCL clus-
tering technique can help the end-user in debugging a spreadsheet. All the end-user
the same cluster. A similar analysis goes for cluster 2 which has cell members C6
and C20.
Notice that in Fig. 5.7, Microsoft Excel is producing an error warning message
about cell B20. We feel that the error message is not very helpful as the clue to the
solution of the error is not appropriate. On the other hand, it is easy to deduce the
source of the error from MCL clusters by simply verifying whether the members of
It is easy to see that cluster 3 and cluster 4 belong to logical areas that match
with the user’s perspective. However, it is interesting to note that cells B5, C5,
B7 and C7 do not belong to any cluster. A closer look at Fig. 5.5 and take note
that these cells have not been highlighted. A spreadsheet developer would therefore
ask himself/herself as to why this is the case. In trying to provide answers to this
question, they would realize that these cells have not been used in any calculation
in the spreadsheet. This is a potential error in the spreadsheet and thus the MCL
clustering technique can help the end-user in finding about the error and therefore
79
5.4 Summary of experiment results
The MCL algorithm will most often produce clusters that match with the user’s
perspective of logical areas in spreadsheets. The identified MCL clusters have also
been shown to aid in the spreadsheet debugging process through the process of
cells.
80
Chapter 6
Implementation
6.1 Introduction
We implemented the visualization tool using the Microsoft Excel spreadsheet system
in conjuction with the Graphael graph drawing software. Microsoft Excel was chosen
choice for clustering. We did our programming on the Microsoft Excel side using
the programing language, Visual Basic for Applications (VBA). On the other hand,
we had to modify the open-source Java code of Graphael to suit our needs.
the prototype of the visualization tool are given in Figure 6.2 and Figure 6.3.
81
In the architecture, whenever a user initiates the cluster generation process by in-
interface, the spreadsheet parser module will be run (the “Visualize” menu provides
process of rowwise iteration through all used spreadsheet cells. The cell dependency
information is written to a text file in the Graph Modelling Language (GML) [33]
file format. The algorithm for the spreadsheet parser module is given in Algorithm
program then parses the GML text fle for syntactical conformance to the Graph
Modelling Language (GML). This is done using the GML graph parser which is
82
Figure 6.2: A screenshot of the prototype for the visualization with a “Balance
Sheeet” spreadsheet, a cluster window (top-right window) and a treemap window
(bottom-right window).
Figure 6.3: A screenshot of the prototype showing the formula view of the “Balance
Sheet” spreadsheet.
83
that Mij = 1 if and only if (i, j) is an edge in graph G and Mij = 0 otherwise. An
or using any appropriate data structure. From the generated graph, the Markov
After the MCL graph clusters have been produced, the clusters are displayed by
organizing them in a cluster tree. The leaves of the cluster tree are nodes (cells) of
a particular cluster. The top-most level view of the cluster tree is a single node. A
cluster tree is displayed at different levels in the cluster display window as in Fig. 6.2.
A right-click or left-click of the mouse over a node helps to navigate up and down
the cluster tree. Nodes which have already been visited are distinguished using dif-
ferent colourings. Green colour is used to indicate an already visited node (see Fig.
6.4). Cell members of a currently selected cluster are labelled with cell information
they represent. Nodes representing clusters which are not currently under selection
are left unlabelled. Linkages between currently selected cluster members with other
clusters are indicated through edge connections amongst them. This is because a
cell may belong to more than one logical area in the corresponding spreadsheet.
This technique is known as compound fisheye views. Compound fisheye views help
us to view members of a particular cluster while showing their linkages with other
clusters. To help in the navigation of the cluster tree, we also use a treemap window
in coordination with the cluster window as in Fig. 6.2. While we are navigating the
cluster tree in the cluster window, we use the treemap to know the depth we are at
in the cluster tree. The treemap also helps to know the number of nodes which are
84
Figure 6.4: A screenshot of the prototype showing the “Balance Sheet” spreadsheet
with highlighted logical areas.
As the user selects on members of a particular cluster, the cluster members are
written into a text file. Upon demand, members of a currently selected cluster in
the cluster window can be highlighted in the spreadsheet . This is done by invoking
Behind the scenes, the spreadsheet cluster/logical area highlighter module, written
in VBA code is run, which uses the cluster member text file to highlight currently
selected cluster cells as in Fig. 6.2. The algorithm for the spreadsheet highlighter
clusters in the cluster window, the user may repeat the spreadsheet highlighting
process, which will lead to all logical areas being highlighted with different colours
in the spreadsheet as in Fig. 6.4. The user is also aided in navigating clusters in
85
the cluster window by clearly labelling visited nodes with different colourings. This
6.3 Summary
86
Chapter 7
Discussion
7.1 Introduction
tion tool that would not only aid in the understanding of a spreadsheet but also aid
trying to achieve the goals of the visualization tool. In the sequel, we present how
when one just uses the superficial numerical (value) view of a spreadsheet. Under-
standing and comprehending a spreadsheet might be necessary when one tries to un-
visualization tool was to use information from the underlying spreadsheet data-flow
87
We achieved this through the use of a graph clustering algorithm that produces
graph clusters that match with logical areas in the original spreadsheet. The iden-
tified clusters are then highlighted using different cell background colours on the
original spreadsheet.
Hence, instead of looking at the spreadsheet as a whole at once, the user focusses
his/her attention on each highlighted logical area at a time. The spreadsheet un-
derstanding process is therefore properly guided since the focus area matches with
what the user might perceive to be a logical area. In addition, the user has also an
option to analyze cell members of a particular cluster on the graph cluster window
Debugging a spreadsheet is also a daunting task since the numerical (value) view
access details of how spreadsheet computations are done through the formula view
this problem. We, therefore, used information from a spreadsheet data-flow graph in
the spreadsheet debugging process by first generating graph clusters using the MCL
algorithm and then highlighting the identified clusters in the original spreadsheet.
88
how through a process of cluster member verification, one can identify some types of
cell belongs to a particular logical area or not. Cell formulas in a particular logical
area are also analysed through the same process. Unused numerical cells in the
spreadsheet are also easily identified since they are not highlighted in the spread-
sheet. This is because they are not part of the spreadsheet data-flow graph since
However, we take note that there are other types of spreadsheet errors which we
can not identify using our visualization tool. For example, if a user enters a wrong
kind of an error. We therefore propose that we use the tool with other existing
Assertions help in making sure that numerical cells have expected values.
However, spreadsheets get larger in size as they undergo such maintenance routines.
We therefore also need a spreadsheet visualization tool that should be able to handle
large spreadsheets. Large spreadsheets result in large data-flow graphs which lead to
algorithm that is designed to scale to large graphs. In particular, we used the MCL
89
algorithm. Our experiments showed that it was indeed scaling well to large graphs.
In addition, the spreadsheet tool might also be used in creating spreadsheet doc-
umentation artefacts since one can capture and store cell dependency information
at a particular time through the use of external graph definition text files. The
and then display the graph separately from the spreadsheet window. We separated
the data-flow graph from the spreadsheet because we believe that superimposing
the graph over the spreadsheet clutters the view of the spreadsheet. However, we
tried to maintain the mapping between the spreadsheet and the graph by labelling
graph nodes with corresponding familiar spreadsheet cell addresses such as “A1”.
All spreadsheet cells with formulas also had their corresponding graph nodes labelled
The graphs can also be regenerated anytime the user wishes to do so by just clicking
the same time accessing the corresponding graph on the other side thus achieving
90
real-time spreadsheet-graph interactivity.
To deal with the problem of visualizing large spreadsheet graphs (which comes from
large spreadsheets), we successfully employed the MCL algorithm which was satis-
factorily able to find “natural” clusters in the graphs which are then highlighted in
finds clusters that match with logical areas in the corresponding spreadsheet. This
is because we did not want a clustering algorithm that produces “meaningless” clus-
ters since that will lead to the incomprehensibility of the spreadsheet thus defeating
mentary windows for the visualization: the cluster window and the treemap window.
The generated MCL clusters are arranged in cluster tree which is displayed in the
cluster window. The cluster window displays the generated MCL clusters as nodes
with the root of the cluster tree being represented as a node which is used as the
starting point for graph navigation. On the other hand, a treemap is a visualization
our case, the cluster tree is a hierarchical decomposition of the data-flow graph and
as a result the treemap is complementary navigation aid as one accesses the graph
in the cluster window. Treemaps not only help to visualize the depth we are at
while navigating a cluster tree but also indicate the number of member nodes in a a
91
selected cluster. However, we note that the inclusion of the treemap window on the
display introduces three different windows (i.e. the spreadsheet window, the cluster
window and the treemap window). This might lead to the problem of information
7.6 Summary
In this chapter, we discussed some of the issues being addressed in our graph-based
92
Chapter 8
Conclusion
the original spreadsheet. The main purpose of separating the graph from the
of the spreadsheet view. However, we note that an issue that can be raised
is the difficulty in the mapping between the spreadsheet and the graph. We
addition, we showed the link between spreadsheet cells and the corresponding
sheets (which lead to large data-flow graphs). This has been handled with the
93
the visualization of large graphs. We tried to improve navigability of graph
(iii) Provision of a clustering algorithm which identifies graph clusters that match
with logical areas in the original spreadsheet. We achieved this by using the
satisfactorily identifies graph clusters that match with logical areas in spread-
sheets. This is a novel way of finding logical areas in spreadsheets since the
cell formulas.
data-flow graphs which are separated from the spreadsheets. This is in contrast
(ii) We have used a novel way of visualizing large spreadsheet data-flow graphs
graphs that match with logical areas in spreadsheets (Logical areas are not
(iii) We have also demonstrated how the graph-based visualization using the MCL
94
algorithm can assist in understanding and debugging spreadsheets.
8.3 Limitations
We have used two different software applications hand in hand to produce the vi-
use non-compatible programming languages i.e. VBA for Microsoft Excel and Java
for the Graphael program. This meant that we had to find a VBA-Java application
that we had to use text files as a means of communication between Microsoft Excel
and the Graphael program. We also had to use a text file file to submit the input
graph to the graph drawing software because currently graph drawing software ac-
cepts only graph definitions in files which can later be parsed and thereafter a graph
is generated. Similarly, for the cluster highlighting process, cluster member names
are written in a text file by the Graphael program afterwhich the file is accessed
by Microsoft Excel and then the cluster members are highlighted in the spreadsheet.
Writing and reading text files brings in a computation overhead which affects the
interaction response time would have been improved if the graph drawing procedure
not a powerful programming language to handle advanced algorithms like the MCL
95
algorithm which need complex data structures.
(i) We plan to import the MCL algorithm and other graph drawing procedures into
guages such as Microsoft C#. This will eliminate the need for separate graph
response time.
(iii) We also plan to conduct trials of the visualization tool with spreadsheet users
96
Bibliography
[1] Abello, J., Kobourov, S. G., and Yusufov, R. Visualizing Large Graphs
with Compound-Fisheye Views and Treemaps. In Proceedings of the 12th In-
ternational Symposium on Graph Drawing (2004), pp. 431–441.
[2] Abraham, R., and Erwig, M. Header and Unit Interference through Spatial
Analyses. In IEEE International Symposium on Visual Languages and Human-
Centric Computing (2004), IEEE, pp. 165–172.
[4] Abraham, R., and Erwig, M. How to Communicate Unit Error Messages
in Spreadsheets. In Proceedings of the First Workshop on End-User Software
Engineering (New York, NY, USA, 2005), ACM, pp. 1–5.
[5] Abraham, R., and Erwig, M. Type Inference for Spreadsheets. In Pro-
ceedings of the 8th ACM SIGPLAN Symposium on Principles and Practice of
Declarative Programming (New York, NY, USA, 2006), ACM, pp. 73–84.
[9] Ballinger, D., Biddle, R., and Noble, J. Spreadsheet Structure Inspec-
tion Using Low Level Access and Visualisation. In Proceedings of the Fourth
Australasian Conference on User Interfaces (Darlinghurst, Australia, 2003),
Australian Computer Society, Inc., pp. 91–94.
97
[10] Blackwell, A. F. What is Programming? In Proceedings of the 14th Work-
shop of the Psychology of Programming Interest Group, J. Kuljis, L. Baldwin,
and R. Scoble, Eds. PPIG, London, UK, 2002, pp. 204–218.
[11] Burnett, M., atwood, J., Djang, R. W., Gottfried, H., Reichwein,
J., and Yang, S. Forms/3: A First Order Visual Language to Explore the
Boundaries of the Spreadsheet Paradigm. Journal of Functional Programming
11, 2 (2001), 155–206.
[12] Burnett, M., Cook, C., Pendse, O., Rothermel, G., Summet, J.,
and Wallace, C. End-User Software Engineering with Assertions in the
Spreadsheet Paradigm. In ICSE ’03: Proceedings of the 25th International Con-
ference on Software Engineering (Washington, DC, USA, 2003), IEEE Com-
puter Society, pp. 93–103.
[13] Burnett, M., Cook, C., and Rothermel, G. End-User Software Engi-
neering. Commun. ACM 47, 9 (2004), 53–58.
[14] Chen, T. Y., Kuo, F.-C., and Zhou, Z. Q. An Effective Testing Method
for End-User programmers. In Proceedings of the First Workshop on End-user
Software Engineering (New York, NY, USA, 2005), ACM, pp. 1–5.
[15] Clermont, M. Heuristics for the Automatic Identification of Irregularities
in Spreadsheets. In Proceedings of the First Workshop on End-user Software
Engineering (New York, NY, USA, 2005), ACM, pp. 1–6.
[16] Clermont, M., Hanin, C., and Mittermeir, R. T. A Spreadsheet Tool
Evaluated in an Industrial Context. In Proceedings of the 3rd European Spread-
sheet Risks Interest Group Symposium (Cardiff, Wales, 2002).
[17] Davis, J. S. Tools for Spreadsheet Auditing. International Journal of Human-
Computer Studies 45, 4 (1996), 429–442.
[18] Deligiannidis, L., Kochut, K. J., and Sheth, A. P. User-centered
Incremental Data Exploration and Visualizaton. Tech. rep., LSDIS Lab and
Computer Science, University of Georgia, Anthens, USA, 2006.
[19] Di-Battista, G., Eades, P., Tamassia, R., and Tollis, I. G. Graph
Drawing: Algorithms for the Visualization of Graphs. Prentice–Hall, Upper
Saddle River, New Jersey, USA, 1999.
[20] Ellson, J., Gansner, E., Koutsofios, L., North, S. C., and Wood-
hull, G. Graphviz Open Source Graph Drawing Tools. In Graph Draw-
ing, Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2002,
pp. 594–597.
[21] Engels, G., and Erwig, M. ClassSheets: Automatic Generation of Spread-
sheet Applications from Object-Oriented Specifications. In ASE ’05: Proceed-
ings of the 20th IEEE/ACM international Conference on Automated Software
Engineering (New York, NY, USA, 2005), ACM, pp. 124–133.
98
[22] Enright, A. J., Van Dongen, S., and Ouzounis, C. A. An Efficient Al-
gorithm for Large-Scale Detection of Protein Families. Nucleic Acids Research
30, 7 (2002), 1575–1584.
[24] Fisher, M., Cao, M., Rothermel, G., Cook, C. R., and Burnett,
M. Automated Test Case Generation for Spreadsheets. In Proceedings of the
24th International Conference on Software Engineering (New York, NY, USA,
2002), ACM, pp. 141–153.
[25] Fisher, M., and Rothermel, G. The EUSES Spreadsheet Corpus: a shared
resource for supporting experimentation with spreadsheet dependability mech-
anisms. SIGSOFT Software Engineering Notes 30, 4 (2005), 1–5.
[26] Forrester, D., Kobourov, S. G., Navabi, A., Wampler, K., and
Yee, G. V. Graphael: A System for Generalized Force-Directed Layouts. In
Graph Drawing, J. Pach, Ed., vol. 3383 of Lecture Notes in Computer Science.
Springer, 2004, pp. 454–464.
[27] Galletta, D. F., Hartzel, K. S., Johnson, S., Joseph, J., and
Rustagi, S. An Experimental Study of Spreadsheet Presentation and Error
Detection. In HICSS ’96: Proceedings of the 29th Hawaii International Confer-
ence on System Sciences (HICSS) Volume 2: Decision Support and Knowledge-
Based Systems (Washington, DC, USA, 1996), IEEE Computer Society, p. 336.
[33] Himsolt, M. GML: a portable Graph File Format. Tech. rep., Univer-
sitt Passau, 94030 Passau, Germany, 1996. URL: http://www.infosun.
99
fim.uni-passau.de/Graphlet/GML/gml-tr.html, Access date: 10th August,
2007.
[39] Myers, B. A., Burnett, M. M., Wiedenbeck, S., and Ko, A. J. End
user software engineering: CHI 2007 special interest group meeting. In CHI ’07
Extended Abstracts on Human Factors in Computing Systems (New York, NY,
USA, 2007), ACM, pp. 2125–2128.
[41] Nardi, B., and Miller, J. The Spreadsheet Interface: A Basis for End User
Programming. Hewlett-Packard, 1990.
100
[45] Panko, R. R. Spreadsheet Errors: What We Know. What We Think We Can
Do. In Proceedings of the European Spreadsheet Risk Interest Group Symposium
(2000).
[46] Panko, R. R., and Sprague, R. H. Hitting the Wall: Errors in Developing
and Code Testing a Simple Spreadsheet Model. Decision Support Systems 22,
4 (1998).
[47] Pietriga, E. A Toolkit for Addressing HCI Issues in Visual Language Environ-
ments. IEEE Symposium on Visual Languages and Human-Centric Computing
(VL/HCC) 00 (2005), 145–152.
[48] Randolph, N., Morris, J., and Lee, G. A Generalised Spreadsheet Ver-
ification Methodology. In ACSC ’02: Proceedings of the Twenty-Fifth Aus-
tralasian Conference on Computer science (Darlinghurst, Australia, Australia,
2002), Australian Computer Society, Inc., pp. 215–222.
[49] Ronen, B., Palley, M. A., and Henry C. Lucas, J. Spreadsheet Analysis
and Design. Commun. ACM 32, 1 (1989), 84–93.
[50] Rothermel, G., Burnett, M., Li, L., Dupuis, C., and Sheretov,
A. A Methodology for Testing Spreadsheets. ACM Transactions on Software
Engineering and Methodology 10, 1 (2001), 110–147.
[55] Scaffidi, C., Shaw, M., and Myers, B. An Approach for Categorizing
End User Programmers to Guide Software Engineering Research. In WEUSE
I: Proceedings of the First Workshop on End-user Software Engineering (New
York, NY, USA, 2005), ACM, pp. 1–5.
101
[57] Seta, K., Ikeda, M., Kakusho, O., and Mizoguchi, R. Capturing a
Conceptual Model for End-User Programming: Task Ontology as a Static User
Model. In User Modeling: Proceedings of the Sixth International Conference,
UM97, A. Jameson, C. Paris, and C. Tasso, Eds. Springer Wien New York,
Vienna, New York, 1997, pp. 203–214.
[58] Sjoberg, D. I. K., Dyba, T., and Jorgensen, M. The Future of Empirical
Methods in Software Engineering Research. In FOSE ’07: 2007 Future of
Software Engineering (Washington, DC, USA, 2007), IEEE Computer Society,
pp. 358–378.
[59] Tichy, W. F. Should Computer Scientists Experiment More? IEEE Computer
31, 5 (1998), 32–40.
[60] Tollis, I. G. Graph Drawing and Information Visualization. ACM Computing
Surveys (1996), 19.
[61] van Dongen, S. MCL - an algorithm for clustering graphs. URL: http:
//micans.org/mcl/, Access date: 1st August, 2007.
[62] van Dongen, S. Graph Clustering by Flow Simulation. PhD thesis, Centre for
Mathematics and Computer Science, University of Utrecht, The Netherlands,
2000.
[63] Vemula, V. R., Ball, D., and Thorne, S. Towards a Spread-
sheet Engineering. In Proceedings of the 2006 European Spreadsheet
Risks Interest Group (2006). URL: http://www.eusprig.org/2006/
vemula-towards-spreadsheet-engineering.pdf, Accessed on 14th August,
2007.
[64] Vemuri, S., Sengupta, S., and Davis, J. S. Data Dependency Diagrams
for Spreadsheet Applications. In Proceedings of the 30th Annual 30th Southeast
Regional Conference (New York, NY, USA, 1992), ACM, pp. 467–470.
[65] Wang, Y., Carzaniga, A., and Wolf, A. L. Four Enhancements to Auto-
mated Distributed System Experimentation Methods. In ICSE ’08: Proceedings
of the 30th International Conference on Software Engineering (New York, NY,
USA, 2008), ACM, pp. 491–500.
[66] Wiggerts, T. A. Using Clustering Algorithms in Legacy Systems Remodu-
larization. In Proceedings of the Fourth Working Conference on Reverse Engi-
neering (WCRE ’97) (Washington, DC, USA, 1997), IEEE Computer Society,
pp. 33–43.
[67] Wilson, A., Burnett, M., Beckwith, L., Granatir, O., Casburn,
L., Cook, C., Durham, M., and Rothermel, G. Harnessing Curiosity to
Increase Correctness in End-User Programming. In Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems (New York, NY, USA,
2003), ACM, pp. 305–312.
102
Glossary
Spreadsheet-related Terminology
fined by cells and fomulas which reference to cells. A spreadsheet system usually
spreadsheet system is Microsoft Excel which comes with the programming language
where data is entered and calculations are specified through formula. In Microsoft
Cell: the intersection of a column and a row. One can enter text, a number or
a formula in a cell.
103
specified through cells and formulas.
formatting that one can use as a basis for developing other spreadsheets.
erence to one or more cells. The results of a formula change if one changes the
Graph-related Terminology
called a vertex or a node. Vertices are also known as nodes. The elements of set E
are called edges and these are unordered pairs of the vertices. For example, the set
V might be {1, 4, 7, 8, 9} and set E might be {{1, 4}, {4, 9}, {1, 8}, {4, 7}}. Together,
by a path.
trees in nature in the sense that graph theoretic trees do not have cycles just as the
104
branches of trees in nature do not split and rejoin.
a whole at once while at the same time providing the ability to the viewer to see de-
tailed parts of the picture without losing the overall context of the picture. Fisheye
views are important in graph drawing because they enable the display of a complex
within nested rectangles, with each level of nesting corresponding to a level of hi-
using treemaps.
105
Appendix A
106
Appendix B
A sample GML graph definition file:
graph [ directed 0
node [
id 1
label “F34 ” ]
node [
id 2
label “F35 ” ]
node [
id 3
label “F36 ” ]
node [
id 4
label “F37 ” ]
node [
id 5
label “F38 ” ]
node [
id 6
label “F39 ” ]
node [
id 7
label “F40=SUM(F34:F39)” ]
edge [
source 1
target 7 ]
edge [
source 2
target 7 ]
edge [
source 3
target 7 ]
edge [
source 4
target 7 ]
edge [
source 5
target 7 ]
edge [
source 6
target 7 ]
]
107