Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
61 views122 pages

Spreadsheet Visualization Simplified

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views122 pages

Spreadsheet Visualization Simplified

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/234808991

A dynamic graph-based visualization for spreadsheets

Article · January 2008

CITATIONS READS

3 3,807

2 authors:

Bennett Kankuzi Yirsaw Ayalew


University of Eastern Finland University of Botswana
14 PUBLICATIONS 78 CITATIONS 35 PUBLICATIONS 243 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Yirsaw Ayalew on 26 May 2014.

The user has requested enhancement of the downloaded file.


University of Botswana

Faculty of Science
Department of Computer Science

A Dynamic Graph-based Visualization


for Spreadsheets

By

Bennett Freinderson Kankuzi


Student ID: 200509238

A dissertation submitted in partial fulfillment of the requirements for the Degree


of Master of Science in Computer Science

Supervised By

Dr. Yirsaw Ayalew

June 2008
Dedication
I would like to dedicate this work to my parents: my late father, Mr. Freinderson
Ishmael Kankuzi and my mother, Mrs Ireen NyaMayuni Kankuzi.

ii
Approval
This dissertation has been examined as meeting the requirements for the partial
fulfillment of Master of Science Degree in Computer Science.

—————————– ———————–
Supervisor Date

—————————– ———————–
Internal Examiner Date

—————————– ———————–
External Examiner Date

—————————– ———————–
Head of Department Date

—————————– ———————–
Dean, School of Graduate Studies Date

iii
Acknowledgements
Firstly, I would like to thank God for giving me strength and courage in the course
of carrying out this work! A big thank you also goes to my supervisor, Dr Yirsaw
Ayalew, who tirelessly guided me in the course of this work. I also thank Dr Ayalew
for introducing me to academic research as well as an exciting world of spreadsheet
research. My other vote of thanks go to Dr Stephen Kobourov of the University of
Arizona who was also co-supervising me in the initial stages of this work and he also
provided me with the open-source code of the Graphael graph drawing software.

My heartfelt thanks also go to Mr. Y. Alide and Dr. P.C. Chamdimba, both from
the University of Malawi, for all the encouragement and support. May God richly
bless you. I would also like to thank all friends and relatives who gave me support
in the course of the work.

Finally, I also thank God for the ‘insights’ in the course of this work such that
a number of research papers have been published out of this research work.

This document has been produced with TeXnicCenter, a free and open-source soft-
ware for the LATEX typesetting system. I am also grateful to its developers.

iv
Declaration
I hereby declare that this is my original work, except where due reference is made,
and that this dissertation has not been submitted for any degree award in any other
university.

Signed: ———————————–
Bennett Freinderson Kankuzi (STUDENT)

v
Abstract
Spreadsheet systems are widely used and highly popular end-user systems. They are
highly popular because of the simplicity with which one can create spreadsheets. How-
ever, despite this simplicity in creating spreadsheets, they are generally difficult to
understand and comprehend. The need for understanding spreadsheets arises when
one wants to debug a spreadsheet as well when one wants to maintain or even just
to understand a spreadsheet created by others. One contributing factor to the diffi-
culty in understanding spreadsheets is due to the invisibility of the data dependencies
which are associated with cell formulas.

This research work aims to provide a graph-based visualization approach that can
simplify understanding and debugging of spreadsheets based on the MCL (Markov
Clustering) algorithm. The MCL algorithm helps in visualizing spreadsheet data-
flow graphs by generating clusters of cells. Navigation through graph clusters is pro-
vided through complementary techniques of compound fisheye views and treemaps.
More importantly, our experiments show that graph based visualization using the
MCL algorithm generates clusters which match with corresponding logical areas of a
spreadsheet. Identified MCL clusters are then dynamically highlighted in the original
spreadsheet using different cell background colours. Hence instead of looking at the
whole spreadsheet at once, the user focusses his/her attention at each highlighted
logical area at a time. The spreadsheet comprehension process is therefore properly
guided since the focus area matches with what the user might perceive to be a logical
unit.

vi
Contents

List of Figures xii

List of Tables xiii

List of Algorithms xiv

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 The end-user programming paradigm . . . . . . . . . . . . . . 1
1.1.2 Challenges in end-user programming . . . . . . . . . . . . . . 3
1.1.3 Popularity of spreadsheet systems . . . . . . . . . . . . . . . . 5
1.1.4 Importance of spreadsheets . . . . . . . . . . . . . . . . . . . 7
1.1.5 Impact of errors in spreadsheets . . . . . . . . . . . . . . . . . 8
1.1.6 Classification of errors in spreadsheets . . . . . . . . . . . . . 9
1.2 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Objectives of our research . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Overview of the rest of the Dissertation . . . . . . . . . . . . . . . . . 18

2 Related Work 19
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Spreadsheet error prevention techniques . . . . . . . . . . . . . . . . . 20
2.3 Spreadsheet error detection techniques . . . . . . . . . . . . . . . . . 22
2.4 Spreadsheet visualization techniques . . . . . . . . . . . . . . . . . . 25

vii
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Graph-based Visualization 38
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 The need for graph clustering . . . . . . . . . . . . . . . . . . . . . . 39
3.3 An overview of clustering algorithms . . . . . . . . . . . . . . . . . . 40
3.3.1 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . 42
3.3.2 Construction algorithms . . . . . . . . . . . . . . . . . . . . . 43
3.3.3 Hierarchical algorithms . . . . . . . . . . . . . . . . . . . . . . 43
3.3.4 Graph theoretical algorithms . . . . . . . . . . . . . . . . . . . 44
3.4 Choice of clustering algorithm . . . . . . . . . . . . . . . . . . . . . . 44
3.4.1 An overview of the MCL algorithm . . . . . . . . . . . . . . . 45
3.5 Choice of graph drawing software . . . . . . . . . . . . . . . . . . . . 48
3.5.1 Experiments with the ZGRViewer graph drawing software . . 49
3.5.2 Experiments with the Graphael graph drawing software . . . . 51

4 The MCL Algorithm and Logical Areas in Spreadsheets 52


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Generating spreadsheet data-flow graphs using Graphael . . . . . . . 52
4.3 Determining the inflation operator for the MCL algorithm . . . . . . 57
4.3.1 Discussion of experiment results . . . . . . . . . . . . . . . . . 63
4.4 Testing the efficacy of the MCL algorithm on more spreadsheets . . . 64
4.4.1 Discussion of experiment results . . . . . . . . . . . . . . . . . 65

5 Comprehending and Debugging Spreadsheets Using MCL Clus-


ters 71
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Analysis of the Project Accounting spreadsheet . . . . . . . . . . . . 71
5.2.1 Verification of MCL clusters for the Project Accounting spread-
sheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Analysis of the IPO spreadsheet . . . . . . . . . . . . . . . . . . . . . 76
5.3.1 Verification of MCL clusters for the IPO spreadsheet . . . . . 78

viii
5.4 Summary of experiment results . . . . . . . . . . . . . . . . . . . . . 80

6 Implementation 81
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 Software architecture of the visualization tool . . . . . . . . . . . . . 81
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

7 Discussion 87
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.2 Spreadsheet understanding and comprehension . . . . . . . . . . . . . 87
7.3 The spreadsheet debugging process . . . . . . . . . . . . . . . . . . . 88
7.4 Spreadsheet maintenance . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.5 Addressing HCI aspects . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

8 Conclusion 93
8.1 A summary of the research work . . . . . . . . . . . . . . . . . . . . . 93
8.2 Our contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Bibliography 102

Glossary 103

Appendix A 106

Appendix B 107

ix
List of Figures

1.1 An illustration of different views of a spreadsheet by Igarashi et al. [34] 13

2.1 A Microsoft Excel spreadsheet with data-flow graph arrows. Sourced


from [53]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Formula view of the Microsoft Excel spreadsheet depicted in Fig. 2.1. 27
2.3 A spreadsheet with its corresponding online data dependency diagram 28
2.4 An animated presentation of fluid-like flow of data in a spreadsheet
by Igarashi et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 A screenshot of the S2 visualization by Sajaniemi. . . . . . . . . . . . 30
2.6 A formula view of the spreadsheet given in Fig. 2.5. . . . . . . . . . 31
2.7 A spreadsheet with highlighted logical areas (equivalence classes) by
Clermont et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.8 A data-flow graph of semantic classes as proposed by Clermont et al. 34
2.9 A sample spreadsheet data-flow graph by Ballinger et al. . . . . . . . 35
2.10 Hyperbolic view of a spreadsheet data-flow graph by Ballinger et al. . 35

3.1 A sample Project Accounting spreadsheet. Adapted from [38]. . . . . 40


3.2 The formula view of the Project Accounting spreadsheet . . . . . . . 41
3.3 A data flow graph of the given Project Accounting spreadsheet gen-
erated by the Graphael graph drawing software. . . . . . . . . . . . . 41
3.4 An example MCL cluster separation process from van Dongen [61]. . 47
3.5 A screenshot of the ZGRViewer graph drawing software displaying an
unzoomed data-flow graph of a spreadsheet. . . . . . . . . . . . . . . 50
3.6 A screenshot of a zoomed-in spreadsheet data-flow graph in ZGRViewer. 50

x
4.1 The sample Project Accounting spreadsheet . . . . . . . . . . . . . . 53
4.2 Formula view of the Project Accounting spreadsheet. . . . . . . . . . 53
4.3 An illustration of a cluster tree . . . . . . . . . . . . . . . . . . . . . 54
4.4 A top-most level view of the cluster tree of the Project Accounting
spreadsheet data-flow graph as displayed using Graphael. . . . . . . . 54
4.5 Second level view of the cluster tree. . . . . . . . . . . . . . . . . . . 56
4.6 An MCL cluster containing cells D6, F6, G6, H6 . . . . . . . . . . . . 56
4.7 An MCL cluster containing cells F10, G10 and H10 . . . . . . . . . . 57
4.8 Treemap and cluster tree with Γ = 1.1 . . . . . . . . . . . . . . . . . 58
4.9 Treemap and cluster tree with Γ = 1.5 . . . . . . . . . . . . . . . . . 59
4.10 Treemap and cluster tree with Γ = 2.0 . . . . . . . . . . . . . . . . . 60
4.11 Treemap and cluster tree with Γ = 2.5 . . . . . . . . . . . . . . . . . 61
4.12 Treemap and cluster tree with Γ = 3.0 . . . . . . . . . . . . . . . . . 62
4.13 Treemap and cluster tree with Γ = 5.0 . . . . . . . . . . . . . . . . . 62
4.14 Treemap and cluster tree with Γ = 7.0 . . . . . . . . . . . . . . . . . 63
4.15 The Project Accounting spreadsheet showing highlighted MCL clus-
ters (when Γ = 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.16 The formula view of the Project Accounting spreadsheet with high-
lighted MCL clusters (when Γ = 2) . . . . . . . . . . . . . . . . . . . 64
4.17 The Consolidated Balance Sheet spreadsheet from the EUSES spread-
sheet corpus [25] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.18 The formula view of the Consolidated Balance Sheet spreadsheet . . . 67
4.19 A treemap and cluster tree for the Consolidated Balance Sheet de-
picting a cluster with cell members, F34, F35, F36, F37, F38, F39
and F40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.20 The Consolidated Balance Sheet with highlighted (shaded) MCL clus-
ters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.21 Formula view of the Consolidated Balance Sheet with highlighted
(shaded) MCL clusters . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1 The Project Accounting spreadsheet . . . . . . . . . . . . . . . . . . 72

xi
5.2 The formula view of the Project Accounting spreadsheet. . . . . . . . 72
5.3 Microsoft Excel displays an error message for a cell in MCL cluster
number 5 in the Project Accounting spreadsheet. . . . . . . . . . . . 73
5.4 A sample IPO spreadsheet sourced from Ray Panko’s spreadsheet
research website[43] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5 The IPO spreadsheet with highlighted MCL clusters. . . . . . . . . . 77
5.6 The formula view of the IPO spreadsheet . . . . . . . . . . . . . . . . 77
5.7 IPO spreadsheet with an Microsoft Excel warning message . . . . . . 78

6.1 Conceptual architecture of the spreadsheet visualization tool . . . . . 82


6.2 A screenshot of the prototype for the visualization with a “Balance
Sheeet” spreadsheet, a cluster window (top-right window) and a treemap
window (bottom-right window). . . . . . . . . . . . . . . . . . . . . . 83
6.3 A screenshot of the prototype showing the formula view of the “Bal-
ance Sheet” spreadsheet. . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4 A screenshot of the prototype showing the “Balance Sheet” spread-
sheet with highlighted logical areas. . . . . . . . . . . . . . . . . . . . 85

xii
List of Tables

4.1 MCL clusters for the Project Accounting spreadsheet with Γ = 1.1 . . 58
4.2 MCL clusters for the Project Accounting spreadsheet with Γ = 1.5 . . 59
4.3 MCL clusters for the Project Accounting spreadsheet with Γ = 2.0 . 60
4.4 MCL clusters for the Consolidated Balance Sheet spreadsheet . . . . 68

5.1 MCL clusters for the Project Accounting spreadsheet . . . . . . . . . 73


5.2 MCL clusters for the IPO spreadsheet given in Fig. 5.4 . . . . . . . . 76

xiii
List of Algorithms

1 The basic MCL algorithm . . . . . . . . . . . . . . . . . . . . . . . . 48


2 The algorithm for the spreadsheet parser module . . . . . . . . . . . 106
3 The algorithm for the spreadsheet highlighter module . . . . . . . . . 106

xiv
Chapter 1

Introduction

1.1 Background
1.1.1 The end-user programming paradigm

Computer end-users may be defined as people for whom conventional computer pro-

gramming is not their main job although they use computers as part of their daily

lives [10]. However, it is now common place to see computer end-users (hereafter

referred to as end-users) being involved in some form of “programming”. End-users

are being involved in “programming” applications such as spreadsheets, databases,

animations, web applications, simulations, just to mention but a few. Although end-

users are not professional programmers they might be experts in their professional

domains. Some of these end users are educators, scientists, engineers, business pro-

fessionals and many more belong to other professions.

End-users may be motivated to do some “programming” because they want to use

a computer to accomplish a particular goal. For example, a teacher may create a

spreadsheet for recording student grades for a particular course. End-users may also

do some “programming” because this might be an an efficient way of solving a prob-

1
lem in comparison to manually solving the problem. For example, a mathematician

may write some program code using a mathematical software application to find a

solution to a complex differential equation. In all these cases, their main goal would

be accomplishing a task at hand rather than producing high-quality, dependable

program code [37]. Pre-packaged software applications may not be suitable in these

situations because these software applications cannot do every task required by an

individual and worse still, they cannot be customized to every individual’s needs

[40]. This need has led to the birth of the end-user programming paradigm.

The rising growth in the popularity of the end-user programming paradigm can be

attributed to the tools that have been developed to empower this kind of computer

users. For example, the development of the spreadsheet paradigm has led to many

users developing their own spreadsheets, hence doing some “programming”. An end-

user programming environment provides the tools for an end-user to accomplish a

task at hand. Examples of end-user programming environments include spread-

sheet systems, web authoring applications and animation environments. Ideally, an

end-user programming environment should possess the following characteristics [57]:

• It should provide rapid feedback to the user as he/she works in the environment

• It should provide a framework for end-users to easily and smoothly externalize

their problem solving knowledge in their mind into a computer-readable form

• It should have the capability to interpret a problem as described by the user

(conceptual model) and then generate an equivalent runnable problem solving

2
model without violating the intended computational semantics

Statistically, it was estimated that, in the year 2005, there were 55 million end-user

programmers in the United States alone. This was about 20 times greater than the

estimated number of professional programmers [14, 55, 67]. These estimates clearly

indicate that a sizable amount of software produced in the whole world is developed

by non-professional programmers. These end-user programmers write programs not

as their primary job function but rather to support their quest for achieving their

main goal such as accounting, doing office work, developing a web page, etc [40]

1.1.2 Challenges in end-user programming

Despite the huge popularity in end-user programming, programs developed by end-

users are very prone to errors. This is because the programs are not developed

according to software engineering principles as is the case with software developed

by professional software developers. Many end-user developers would not want to

get involved in the nitty-gritties of coding in a particular programming language,

let alone try to learn the formal syntax and semantics of a particular programming

language. In fact, learning programming language syntax has been identified as

one of the significant learning barriers in end-user programming environments [37].

Other learning barriers in end-user programming environments include [37]:

• Design barriers: the end-user programmer might not know what he/she wants

the computer to do in order to solve a problem

• Selection barriers: the end-user programmer might know what he/she wants

3
the computer to do but does not know how to choose an appropriate tool for

the task

• Coordination barriers: the end-user programmer might know the appropriate

tools for a particular task but he/she does not know how to make the tools

work together in order to solve the problem at hand

• Use barriers: the end-user programmer might know what tools to use for a

particular task but does not know how to use those tools

• Understanding barriers: the end-user programmer might think that he/she

knows how to use a particular tool but unfortunately the tool does not do

what he/she expects

• Information barriers: the end-user programmer might think that they know

why a tool behaved in an unexpected or problematic manner but they might

not have knowledge to check the problem

Another major challenge in end-user programming is the reality that end-user needs

vary so widely such that one cannot come up with general design tools and languages

that can fit every end-user programmer’s needs [37]. It is also a major challenge

to make users understand the importance of the programs they develop [37]. This

is particularly true for non-trivial programs that have long life spans such that the

programs might need long-term maintenance. A case in point are some spread-

sheets that are not simple throw-away calculations but are continuously evolving as

part of a business enterprise reporting function. The question would therefore be on

4
how to develop tools that can capture the evolving program’s history and design [37].

In trying to address some of the challenges outlined above, some researchers have

advocated that much research on end-user programming should focus on develop-

ment environments which can help end-users achieve their goals through the use of

metaphors such as forms and spreadsheets [56]. Coupled with the statistics on the

number of end user programmers, it is easy to see the need for more research in

end-user programming.

1.1.3 Popularity of spreadsheet systems

Spreadsheet systems have become a common end-user programming environment

used for trivial as well as non-trivial applications in private and public enterprises

[9, 45, 67]. They are used for a variety of important tasks such as mathematical

modelling, scientific computations, tabular and graphical data representation, data

analysis and decision making. The provision of computational techniques that match

user’s tasks makes spreadsheet programming easier. There is also a trend in the

spreadsheet model as a general model for end-user programming [41]. Spreadsheet

systems are widely used by end-user programmers not only due to their simplicity

but also due to their features which facilitate programming. The suppresion of low-

level details of programming, the immediate visual feedback and the availability of

high-level task specific functions are commonly referred features among many others

[35].

5
Spreadsheet systems allow computations to be defined by cells and their formu-

las. A cell’s value is defined solely by the formula explicitly given to it by the user

[11]. A cell value is recalculated automatically whenever a value on which it depends

( a reference) changes thus providing immediate feedback. Spreadsheet systems also

provide for copying of contiguous regions of cells from one physical area to another.

References between the cells may be either absolute or relative in either their hor-

izontal or vertical index. All copies of an absolute reference will refer to the same

row, column or cell whereas a relative reference refers to a cell with a given offset

from the current cell.

Spreadsheet systems are an example of a functional programming language en-

vironment. In functional programming, computations are specified by providing

arguments to functions and/or operators [11]. However, spreadsheet systems differ

from traditional functional programming language environments in mainly two ways

[11]:

• Spreadsheets are usually associated with first-order functions only. Other tra-

ditional functional programming languages support higher-order functions. In

a first-order function, the arguments are objects like numbers and class objects

- but not themselves functions.

• There is continuous program evaluation in spreadsheet systems which is nec-

essary to provide immediate feedback to the user.

6
The spreadsheet paradigm also differs from the procedural programming paradigm

in several ways:

• Spreadsheet programs are modeless in the sense that they do not require the

user to separately code, compile, link, and execute the spreadsheet program

as is the case with procedural programs [52].

• Spreadsheet programs provide immediate feedback to the user. For exam-

ple, when a formula for a particular cell changes, the results are immediately

reflected [52].

• The structure of a spreadsheet program is usually represented in a two-dimensional

tabular layout while the code for procedural programs is represented in a linear

fashion [6].

• From the point of view of a user, a spreadsheet program does not have clear

separation between input, computational code and output. This is not the

case with procedural programs [6].

1.1.4 Importance of spreadsheets

Despite the fact that some spreadsheet programs (hereafter spreadsheet program

shall be used synonymously with spreadsheet) are simple throw-away scratch-pad

calculations, many spreadsheets have been quite useful for business as well as per-

sonal endevours [67]. There are some large periodically used spreadsheets that are

submitted to regular update-cycles like any conventionally evolving application soft-

ware [16]. This shows that end-user programming, with spreadsheet programming

7
as an example, can not be regarded as a trivial subject.

Panko [45] observed that in a previous study, 46 percent of non-trivial spreadsheets

examined were rated as important or very important to the surveyed organization.

Panko also noted that another study found out that information generated from

spreadsheets is also used in high-level decision making offices in business enterprises.

This shows how critical non-trivial spreadsheets can be, to the running of a busi-

ness enterprise. Therefore errors in spreadsheets may lead to erratic decision making.

Spreadsheets have also been used in science and engineering disciplines such as

physics and chemistry, just to mention a few, because they are more usable than

procedural programs [16]. Another reason for spreadsheet usage in science and engi-

neering is the fact that spreadsheets already incorporate a way of displaying graphs

and this can be very useful in displaying results of scientific experiments.

1.1.5 Impact of errors in spreadsheets

Errors in spreadsheet programs are non-trivial and costly [27, 45]. Despite this ob-

servation, there has not been quantitative data on the impact of spreadsheet errors.

However, the European Spreadsheet Risks Interest Group (EuSpRIG) publishes on

its web page, http://www.eusprig.org/stories.htm, verified stories on how er-

rors in spreadsheet programs have impacted on public as well as private enterprises.

For example, it is documented on the website that in 2004, some city officials, in one

8
of the cities in the United States, miscalculated the amount of sales taxes generated

at one of city’s parks during the first couple of months of its operation. The mistake

inflated the figures by tens of thousands of dollars, which in turn meant the total

sales estimates were overblown by millions of dollars. The mistake was attributed

to an error in a spreadsheet formula which amplified a subtotal amount.

It is also documented that some candidates for police officer jobs were told that

they had passsed an admission test when in fact they had failed. The reason for

this mishap was that the spreadsheet which the examiners had used to record the

scores was sorted improperly.

It is also documented that mis-stated earnings of a company led to the stock price of

an online retailer to fall by 25 percent in a day and the Chief Executive Officer had

to resign. Again a spreadsheet error was to blame. A single erroneous numerical

input in a spreadsheet was the cause of the mis-statement. These are just some

of the stories that underscore the fact that spreadsheet errors are non-trivial and

costly.

1.1.6 Classification of errors in spreadsheets

Data from spreadsheet field studies and laboratory experiments indicate that errors

in spreadsheets are indispensable. Panko [45] has tabulated data indicating error

rates in spreadsheets as produced by the authors of the various field audits and lab-

oratory experiments. The most important result of these studies is that spreadsheet

9
error rates are huge enough to tell us that most non-trivial spreadsheets will contain

errors.

Several classification schemes have been identified to categorize these errors depend-

ing on the context in which a researcher is doing the analysis [7]. Panko [45] identified

three categories of spreadsheet errors namely mechanical, logical and omission er-

rors. Mechanical errors are simple slips such as mistyping a number or pointing to a

wrong cell when entering a formula. Logical errors are defined as errors that occur

when a spreadsheet developer has a wrong algorithm for a particular formula cell.

On the other hand, omission errors are defined as errors that occur when a spread-

sheet developer does not have complete understanding of the problem at hand and

therefore produces an incomplete spreadsheet model of the problem. Hence omission

errors are introduced due to faulty reasoning.

Another general classification scheme used by Panko [45], categorizes spreadsheet

errors as quantitative errors and qualitative errors. A quantitative error is defined

as an error that produces an incorrect value in an least one bottom-line variable in

a spreadsheet. On the other hand, qualitative errors emanate from factors such as

poor spreadsheet design which may later cause problems in data entry or even lead

to incorrect data modifications and hence generate quantitative errors. This scheme

further categorizes quantitative errors into mechanical, omission and logical errors

which have already been defined in the preceeding paragraphs.

10
There is another spreadsheet error classification scheme proposed by Ayalew et al

[7]. Unlike the other classification schemes given above, they do not want to catego-

rize the errors by their cause, but rather by the spreadsheet concept the errors seem

to be associated with. Thus, they have three categories of errors namely: physical

area related errors, logical area related errors and general errors.

Physical area related errors are defined as those errors that normally deal with miss-

ing values in a physical area or values of the wrong type somewhere in the physical

area. This kind of errors leads to several side-effects such as impacting on the results

if new values are added to the area. According to this classification scheme, physical

area related errors include what are termed as “reference to a blank cell/reference to

a cell with value of wrong type” errors, “incorrect physical area specification” errors,

“accidental deletion/addition of a cell within a physical area” errors and “physical

area mix up” errors.

A logical area is defined as an area that represents some kind of cohesion between

cells. It usually originates from copying from the same source multiple times. Ex-

amples of logical area errors include overwriting a formula with a constant value and

having a formula copy misreference.

General errors have been defined as those errors that are not explicitly associated

with a physical or logical area and are usually made during formula definition. An

error might occur due to typographical errors or inability to formulate the necesssary

11
mathematical expression for a formula. An error might also occur due to incorrect

use of formats which might affect the way a value is displayed.

1.2 Statement of the Problem

Despite the simplicity in creating spreadsheets, they are generally difficult to under-

stand and comprehend [17]. The need to understand a spreadsheet may arise if one

wants to debug a spreadsheet. It may also be necessary to understand a spreadsheet

when wants to maintain or even just to comprehend a spreadsheet created by others.

Most spreadsheets are created by end-users and they contain errors which the de-

velopers themselves may not easily notice [45]. Unfortunately, most spreadsheet

errors are not trivial considering the fact that key decisions, for example in business

firms, are based on information extracted from spreadsheets [27, 45]. Therefore, it is

important to help spreadsheet developers expose these errors or even prevent them

from occuring in spreadsheets.

Furthermore, non-trivial spreadsheets may need to be modified by other people other

than the spreadsheet developer himself/herself. Moreover, changes to the struture

of the spreadsheet may be necessary since spreadsheets may need to maintained just

as any conventionally evolving application software [16]. However, for one to make

meaningful changes to the structure of a spreadsheet, he/she needs to understand

the spreadsheet first. Spreadsheets normally come in the two-dimensional tabular

12
arrangement of numeric values with some accompanying explanatory text. Usually

this does not suffice for a third party to clearly comprehend and understand what

the spreadsheet is all about.

A spreadsheet is usually perceived only as a two-dimensional grid of cells popu-

lated mainly with numerical values although every spreadsheet has a formula view

as well as an underlying data-flow graph [34] (see illustration in Fig. 1.1). A data-

flow graph represents the network-structure of cell dependencies expressed by the

references in the individual formulas. However, the data-flow graph is normally

“hidden” from the spreadsheet developer. It is therefore not surprising that most

Figure 1.1: An illustration of different views of a spreadsheet by Igarashi et al. [34]

spreadsheet developers view a spreadsheet as a word processor for numbers and not

necessarily as a complex data-flow graph that spreadsheets really are [15]. Despite

this view from spreadsheet developers, the key to understanding spreadsheets is to

clarify the data dependencies among the cells [17]. In other words, visualizing and

clarifying the inherent data-flow graph can help users understand a spreadsheet as

well as aid in the spreadsheet debugging process. This is so, because human beings

13
process and understand visual representations of data much faster and in a more

effective way than doing so by reading the numerical or textual representations of

the same data [18].

Like the numerical view of a spreadsheet, the formula view of a spreadsheet has

also some disadvantages. For example, the formulas which compute the values of

cells are hidden. It is possible to see either the formulas or the values but not both

at the same time. For a single cell, it is possible to see both at the same time but

this does not give much information about the overall structure of the spreadsheet.

In some cases, this locality to a single cell may help by narrowing the point of focus

instead of dealing with the spreadsheet as a whole, but it is also difficult to get sense

of the general structure of the whole spreadsheet [32, 42]. As a result, it is difficult

to identify where data comes from and where it goes unless one makes a detailed

examination of the cell dependencies.

Therefore, it is against this background that this research work was embarked on

with the aim of developing a tool for visualizing spreadsheet data-flow graphs that

could help in solving problems of spreadsheet comprehension as well as spreadsheet

debugging.

1.3 Objectives of our research

Our research work has four main objectives:

14
(i) We want to generate the data-flow graph of a given spreadsheet with nodes rep-

resenting cells in the spreadsheet and edges representing dependencies between

cells which can make the visualization (the generated data-flow graph) to be

useful for spreadsheet understanding, debugging and maintenance. However,

generating data-flow graphs leads to the problem of visualizing large graphs

since normally the number of nodes and edges in the generated graph becomes

large hence introducing problems of graph navigability and comprehension.

(ii) We would like to deal with the problem of visualizing large graphs through

graph clustering. Clustering allows us to view a manageable subset of the

data-flow graph at a time. Provision of proper navigation techniques through

the generated clusters shall also be an important aspect of this work. More

importantly, we would like to produce “meaningful” clusters i.e. clusters that

match with logical areas of the given spreadsheet. A logical area in a spread-

sheet may be defined as a group of cells in a spreadsheet that from the spread-

sheet creator/user perspective form a logical unit due to the semantics of the

spreadsheet.

(iii) We would like to separate the graph-based visualization from the spreadsheet so

as to avoid the problem of cluttering on spreadsheet display as this introduces

information overload. At the same time we would like to maintain mapping

between spreadsheet cells and graph nodes.

(iv) We would like to generate our visualization dynamically so that we are able to

achieve real-time spreadsheet-visualization interactivity.

15
1.4 Research Methodology

We conduct our research work using a combination of research methodologies namely

experimentation, case study and prototyping.

Experimentation is a term that is not universal [65]. Therefore we define it in

the context of this research work. An experiment shall involve the running of a

computer program multiple times while varying either program inputs or program

parameters and observing the program outcomes. Basing on the observation of the

program outcomes, we infer some system properties and characteristics. Observa-

tions in experiments are very important because they can lead to new useful and

unexpected insights that can also open new areas of investigation [59]. We use

experimentation in this research work in different tasks such as:

• choice of suitable graph drawing software

• determination of performance of our chosen graph clustering algorithm on

different spreadsheets

• determination of suitable clustering parameters of our chosen graph clustering

algorithm.

A case study is an empirical enquiry that allows one to investigate a contemporary

phenomenon within its real-life context [58]. In software engineering, case studies

are useful for the industrial evaluation of different software engineerng tools and

methods [58]. For example, different software tools may be evaluated on how their

16
features may be suitable in accomplishing a particular task. Hence to avoid bias and

to ensure internal validity, a valid basis is identified to assess the results of the case

study [58]. However, case studies have the disadvantage in that results may not be

generalized easily [59]. In this work, we use spreadsheets sourced from the EUSES

Spreadsheet Corpus [25] and the Spreadsheet Research website [43]. This is because

we want to conduct our experiments on real-life spreadsheets. The referred sources

are repositories of spreadsheets collected from different organizations and business

firms.

Prototyping involves the assembly of a model of an unfinished software system.

The features of a prototype portray the capabilities of a finished software system at

a glance. Prototyping may also offer a demonstration that theoretical ideas can be

put into a “real-life” software tool or product. In other words, prototypes provide

proof-of-concepts and they may also provide incentives to study a research question

further [59]. However, it is important to note that prototypes do not provide solid

evidence supporting a theory or ideas [59].

In this reserch work, we assemble a prototype of the spreadsheet visualization tool

using the Microsoft Excel spreadsheet system in conjunction with an open-source

Java based graph drawing software. Programming in the Microsoft Excel spread-

sheet system is done in Visual Basic for Applications (VBA). We also modify the

source code of the graph drawing software to suit the requirements of our applica-

tion.

17
1.5 Overview of the rest of the Dissertation

The rest of the dissertation is organized as follows: Chapter 2 provides a review of

related research works by other researchers in this research area. Our graph-based

approach to the research problem is introduced in Chapter 3. Our experiments with

the MCL algorithm on spreadsheets using the Graphael graph drawing software is

given in Chapter 4. We demonstrate how clusters identified using the MCL algo-

rithm can be used to comprehend and debug spreadsheets in Chapter 5.

A conceptual architecture of an implementation of the prototype of the visualization

tool is presented in Chapter 6. A discussion of the results from this research work

as well as a discussion of some issues that emerged from this research work is given

in Chapter 7. We conclude this dissertation in Chapter 8 with a summary of our

contribution in this research area, a presentation of limitations of our spreadsheet

visualization technique as well as proposed future works.

18
Chapter 2

Related Work

2.1 Introduction

Considering the importance of spreadsheets, several research works have been un-

dertaken to address the problem of quality in spreadsheets. Some research works

focussed on error prevention techniques in spreadsheets while others focussed on er-

ror detection techniques in spreadsheets. Futhermore, other research works focussed

on spreadsheet visualization techniques with the aim of improving error detection,

debugging and general comprehension of spreadsheets. Other researchers focus on

the application of principles of software engineering to spreadsheet development.

This growing research direction is being embodied in a new and growing discipline

known as end-user software engineering [13, 39, 51, 56]. Some of the research ques-

tions being tackled in this research direction include:

• How can software engineering life cycle models be used in spreadsheet devel-

opment?

• How can improved programming practices such as teamwork and code inspec-

tion help in creating error-free spreadsheets? Some work in this area includes

19
that of Panko and Sprague [44, 46] which explored on the benefits of code

inspection in spreadsheets. Vemula et al. [63] also researched on groupwork

in spreadsheet development and testing.

• Development of tools and techniques that can help in testing, debugging and

verification of spreadsheets to minimize risk from errors in spreadsheets. Some

work in this research direction include: using assertions in helping end-user

programmers to correct spreadsheet errors [12], fault tracing in spreadsheets

using “interval testing” and slicing [8], using type inference to identify pro-

gramming errors in spreadsheets [5], just to mention but a few.

2.2 Spreadsheet error prevention techniques

Several research endevours have already reported on techniques that can be used

to prevent errors from happening in spreadsheeets. The rationale for this research

path being the fact that it is easier to prevent than correct errors in spreadsheets.

Ronen et al. [49] proposed a structured approach to spreadsheet design as a way

of preventing errors in spreadsheets. The basis of this proposal was that a lack

of design methodology in spreadsheets brings in problems of reliability, auditabil-

ity and modifiability of spreadsheets. They introduced spreadsheet flow diagrams

(SFDs) as a way of structuring spreadsheets. Spreadsheet flow diagrams are similar

to flow-chart diagrams for structured programming. They argued that spreadsheet

flow diagrams would help the designer structure the spreadsheet solution model to

20
a problem. Spreadsheet flow diagrams could also assist in communicating the struc-

ture of a spreadsheet model to others and they could also serve as a documentation

tool when it is necessary to audit or modify the spreadsheet.

Some researchers have also proposed data control techniques as one way of pre-

venting errors from occuring in spreadsheets (e.g. Panko [45]). Some proposed data

control techniques include:

• protection of cells and worksheets from unauthorized use. For example, cell

protection can allow users to change only pre-specified input cells so that if a

user attempts to “hardwire” a formula cell, they will be prevented from doing

so. In hardwiring a formula cell, a user cursors to a formula cell and enters

a number in the cell. This usually happens when a user does not realize that

the cell was a formula cell and they think that they should just enter a value

in the cell.

• provision of data entry validation through the re-keying of input data. This

method is also used in traditional data processing and it is called data veri-

fication. This method easily prevents errors from occuring since it is easy to

check if two input areas are the same and if not, it is also easy to determine

where the error lies.

Erwig et al. [23] developed a system called Gencel in which spreadsheet templates

using the Visual Template Specification Language (ViTSL) are used to generate

spreadsheets which are free from reference, range or type errors. With this technique,

21
spreadsheet templates are created and verified by domain experts and later on can

be used by less experienced users to generate spreadsheets that always conform to

the template. This concept was extended to include the automatic generation of

spreadsheet templates from object-oriented specifications that have been specified

using Unified Modeling Language (UML) diagrams [21].

2.3 Spreadsheet error detection techniques

It maybe inevitable to introduce errors in spreadsheets. Therefore, some researchers

have focussed on techniques that help in the detection of errors in spreadsheets as

well as in testing techniques for spreadsheets.

Ayalew et al. [7, 8] developed a spreadsheet debugging technique based on “in-

terval testing” and slicing. In this technique, each formula cell has a user-specified

value interval and a system-generated value interval. When the user-specified in-

terval and the system-generated interval for a cell do not agree with the actual

spreadsheet computation, the cell is marked as displaying a symptom of a fault. A

fault tracing strategy is then used to identify the most influential faulty cell from

the cells perceived by the system to contain faults. This is based on the number of

precedents and dependents of the influential faulty cells.

Rothermel et al. [50] also developed a spreadsheet testing methodology which they

termed “What You See Is What You Test” (WYSIWYT) to help users test spread-

22
sheets. Since testing and debugging are closely interrelated, we find it worthwhile

to make mention of this methodology. The methodology uses data-flow adequacy

and coverage criteria to give the user feedback on how well tested a spreadsheet

is. The WYSIWYT testing methodology has been integrated with another spread-

sheet testing technique known as the “Help Me Test” (HMT) [24] technique into the

Forms/3 [11] spreadsheet language. The HMT technique automatically generates

test cases for the user as he/she actively works on the spreadsheet. Forms/3 is a

form-based research spreadsheet language developed at the Oregon State University.

The Forms/3 spreadsheet language also allows users to define assertions on the ex-

pected cell values [12]. To promote the usage of assertions by end-user programmers,

Wilson et al [67], devised a curiosity-centred approach to eliciting assertions from

end-users through a “surprise-explain-reward” strategy .

Randolph et al. [48] developed a spreadsheet verification tool based on the WYSI-

WYT methodology. Their main emphasis was to use the WYSIWYT methodology

algorithms in implementing a spreadsheet independent tool. They placed much em-

phasis on issues of portability and the automatic generation of test cases.

Abraham and Erwig [2] developed an automated reasoning system for spreadsheets

called UCheck. UCheck infers header unit information for cells in a spreadsheet.

Based on the header unit information, the system identifies cells in the spreadsheet

that contain erroneous formulas. They extended the UCheck system to produce a

system known as UFix [4] in order to improve on the way error messages are re-

23
ported to users hence improving the spreadsheet debugging process. Abraham and

Erwig also developed a type system and a type inference algorithm for spreadsheets

which can be used in identifying some kind of errors in spreadsheets [2].

Abraham and Erwig [3] also developed a spreadsheet debugger known as GoalDebug

based on a technique known as “goal-directed debugging”. GoalDebug allows users

to mark cells with incorrect outputs and specify the expected output. The GoalDe-

bug system then generates a list of change suggestions, any one of which when

applied would result in the expected ouput being computed in the marked cell. The

generated change suggestions are ranked based on a set of heuristics before being

presented to the user. The generated change suggestions can be automatically ap-

plied and hence eliminating errors that can be introduced by end users through

editing of cell formulas.

Metamorphic testing is also proposed as a potential way which can be used to

test spreadsheets [14]. This technique has also been used to test other end-user de-

veloped software such as web applications, simulation and scientific computations.

Metamorphic testing utilizes information carried out in successful test cases. An

essential part of metamorphic testing is to identify effective metamorphic relations.

A metamorphic relation is any relation among program inputs and the outcomes of

multiple executions of the target program. The outcomes of multiple executions of

the target program using isomorphic test cases are supposed to match, otherwise

the tested program is at fault. A good metamorphic relation can be identified eas-

24
ily by a program tester who has black-box knowledge of the problem domain and

white-box knowledge of the program structure.

2.4 Spreadsheet visualization techniques

Various spreadsheet visualization tools have also been proposed for different pur-

poses such as spreadsheet comprehension, debugging, documentation, etc. Most of

these spreadsheet visualization tools are based on the data-flow graph behind the

spreadsheets [53]. Spreadsheet visualization is part of a discipline known as Infor-

mation Visualization. Information visualization through automatic graph drawing

involves construction of geometric representations of conceptual structures that are

modelled as objects and connections between those objects [60]. In a graph, the ob-

jects are represented by nodes and edges are used to represent connections (reation-

ships) between those objects. Automatic generation of graph drawings has been

carried out for a wide variety of information visualization applications in science as

well as in engineering [19]. Some example application areas include:

• the World Wide Web: visualization of site maps and construction of browsing

history diagrams, etc.

• Software Engineering: construction of data flow diagrams, program call graphs,

object-oriented class hierarchies, entity-relationship diagrams, etc

• Artificial Intelligence: construction of knowledge representation diagrams

• Management Science: construction of organization charts, PERT diagrams

25
Our research work focusses on the visualization of spreadsheet structures through

the automatic generation of corresponding data dependency (data-flow) graphs.

Microsoft Excel, a popular commercial spreadsheet system, provides a built-in prece-

dents/dependents tracer tool which upon request allows a spreadsheet developer to

either get precedents or dependents of a particular cell. Arrows are then drawn

linking the precedents or dependents to the selected cell. Fig. 2.1 shows a Microsoft

Excel spreadsheet with arrows depicting the data-flow graph as generated by the

tracer tool. The formula view of the spreadsheet is given in Fig. 2.2. One prob-

Figure 2.1: A Microsoft Excel spreadsheet with data-flow graph arrows. Sourced
from [53].

lem with this tool is that one can not get the overall data-flow graph for the whole

spreadsheet at a single request. Therefore one cannot have a global view of the

overall data-flow graph in a single step. Another major drawback with this kind of

visualization is that the visualization is superimposed on the spreadsheet display.

This clutters the spreadsheet view and as a result reduces readability and compre-

26
Figure 2.2: Formula view of the Microsoft Excel spreadsheet depicted in Fig. 2.1.

hension of the spreadsheet.

Davis [17] produced two spreadsheet visualization tools: the arrow tool and the

online data dependency diagram. The arrow tool is similar to the earlier versions

of the Microsoft Excel (MS Excel 97) precedents/dependents tracer tool with the

exception that the arrow tool coloured precedent and dependent cells in addition

to grouping logically related cells. Again the visualization is superimposed on the

spreadsheet display hence bringing in problems associated with the Microsoft Ex-

cel’s precedents/dependents tracer tool.

Online data dependency diagrams, as spreadsheet visualization tools, are based on

flow-chart like diagrams (see Fig. 2.3). Distinctive symbols are used to represent

cells according to whether they function as inputs, outputs, decision variables or

parameters of formulas. Arrows are used to show data dependencies amongst the

cells by connecting the symbols. The visualization produced is not superimposed on

the spreadsheet display as in the other tools explained in the preceeding paragraphs.

27
Instead, the tool displays the spreadsheet in a window on one side of the screen and

the diagram in a separate window on the other side as in Fig. 2.3. However, it has to

be noted that the visualization is statically generated. As a result, the tool’s author

suggested that if this visualization could be produced automatically, it could serve

as a practical spreadsheet auditing tool because one could produce it when needed.

Davis continues to state that the visualization was statically generated because at

Figure 2.3: A spreadsheet with its corresponding online data dependency diagram

the time the visualization was proposed, there were not good enough graph drawing

algorithms. This is not the case right now and therefore we would like to exploit

the availability of such robust graph drawing algorithms for automatic (dynamic)

generation of such kind of visualizations.

On a related note, Vemuri et al. [64] conducted an experimental study on the use-

fulness of online data-dependency diagrams for visualizing spreadsheets. Although

their study did not conclude that online data-dependency diagrams were useful, their

studies indicated optimism by users that online data-dependency diagrams would

be useful for maintaining larger spreadsheets.

28
Figure 2.4: An animated presentation of fluid-like flow of data in a spreadsheet by
Igarashi et al.

Igarashi et al. [34] also developed a visualization tool that depicts a fluid-like flow

of data in a spreadsheet as illustratd in Fig. 2.4. The main emphasis in this

visualization tool is the visualization of the hidden data-flow structure behind the

tabular layout of a spreadsheet. Transient local views are used to visualize data-flow

structures associated with individual cells while it is possible to view the data-flow

structure of the entire spreadsheet at once. A user is also able to navigate through

the data flow structure interactively and it is possible to construct formulas using

graphical editing techniques hence the provision of visual editing. However the main

drawback with the tool is that it fails to scale on spreadsheets containing more than

400 used cells because there is noticeable degradation in performance with more

than 400 used cells. This limits the application of the technique to larger spread-

29
sheets.

Sajaniemi [53] developed the S2 and S3 spreadsheet visualization tools in which

logical areas or semantic units in a spreadsheet are highlighted and data-flow be-

tween logical areas is indicated through arrows. A screenshot of the S2 visualization

tool is given in Fig. 2.5 and the corresonding formula view of the spreadsheet is given

in Fig. 2.6. The S3 visualization is a slight improvement to the S2 visualization.

Highlighted areas in the visualization describe the plan structure of the spreadsheets

and deviations from this structure show clearly in the visualization hence helping in

the spreadsheet debugging process. Both tools have the disadvantage that they are

also superimposed on the spreadsheet display hence introducing cluttering of the

display. In addition to this, the overall data-flow graph cannot be generated in a

single step.

Figure 2.5: A screenshot of the S2 visualization by Sajaniemi.

On the same line, Sajaniemi further suggested that spreadsheet visualization tools

30
Figure 2.6: A formula view of the spreadsheet given in Fig. 2.5.

should satisfy the following salient features [53]:

• The visualization should be superimposed on the spreadsheet so as to reduce

cognitive overhead in mapping between two display windows. However, we

note that the superimposition of the visualization graph on the spreadsheet

display clutters the view of the spreadsheet. Therefore our work shall avoid

this problem by providing a separate window for the visualization.

• The visualization can be constructed dynamically since users would probably

like to use tools that require less user intervention. Our visualization tool shall

satisfy this requirement since the visualization shall be dynamically generated

upon the click of an appropriate button on the spreadsheet interface.

Ayalew et al. [7] proposed a graphical spreadsheet visualization model that is not

only based on a data-flow graph but also on visualizing logical and physical areas

in spreadsheets. They proposed that such a visualization should be generated auto-

matically with little or no intervention of the spreadsheet programmer. They also

proposed that the visualization should allow zooming into specific areas of the gen-

31
erated graph without losing the global view of the graph using fisheye views. Their

proposed visualization model was to serve three purposes:

• shortening the trial and error process to develop solutions for real-world prob-

lems through support for problem understanding since problem understanding

can be supported by the graphical representation of the spreadsheet model.

• help in the maintenance of existing spreadsheets since a visualization can help

in the understanding of spreadsheet programs developed by others.

• enabling comparison of spreadsheets at the level of the spreadsheet model

based on model properties such as data-flow, physical and logical areas and

not just cell values.

Clermont et al. [16] developed a spreadsheet visualization toolkit that partitions a

spreadsheet into logical areas known as equivalence classes. The equivalence classes

are mainly based on structural similarity of formulas. Identified equivalence classes

are then highlighted in the original spreadsheet as in Fig. 2.7. The toolkit has three

components:

• A structure browser which displays the generated equivalence classes.

• A dependency viewer that displays the data flow graph between the dependen-

cies of the cells that are in the equivalence classes that is currently highlighted

in the structure browser.

• The spreadsheet itself which gives feedback to the user/programmer by high-

lighting the cells that are in the equivalence class (logical area) that is currently

32
selected in the structure browser.

With large spreadsheets (e.g. having more than 5000 used cells), the number of

equivalence classes becomes too large and hence they devised a further abstraction

mechanism called semantic classes. Semantic classes are represented as nodes in a

generated graph and data-flow between cells in different semantic classes is repre-

sented by directed edges as in Fig. 2.8. These graphs are not dynamically generated

Figure 2.7: A spreadsheet with highlighted logical areas (equivalence classes) by


Clermont et al.

since information about a spreadsheet (e.g. cell dependencies) is extracted and pro-

cessed separately from the spreadsheet.

Ballinger et al. [9] developed a spreadsheet visualization tool that would first stat-

ically extract artefacts from spreadsheets and then convert this information into

visualizations such as spreadsheet data-flow diagrams. A sample spreadsheet data-

flow diagram is depicted in Fig. 2.9. Ballinger et al. also used a hyperbolic viewer

to view the generated spreadsheet data-flow graphs in an attempt to deal with the

33
Figure 2.8: A data-flow graph of semantic classes as proposed by Clermont et al.

problem of cluttering in graphs with a large number of nodes and edges (see Fig.

2.10). Unfortunately, hyperbolic viewing does not provide for views in which the

current view displays nodes which match with logical areas in the corresponding

spreadsheet. We want to produce graph views which match with logical areas in the

spreadsheet.

2.5 Summary

The aforementioned spreadsheet visualization tools and techniques indeed offer very

useful insights about data-flow as well as data patterns in spreadsheets which would

not have been possible by just analysing the “data value” view of a given spreadsheet.

However, as already pointed out, there are some drawbacks with the aforemen-

34
Figure 2.9: A sample spreadsheet data-flow graph by Ballinger et al.

Figure 2.10: Hyperbolic view of a spreadsheet data-flow graph by Ballinger et al.

35
tioned approaches. For example, in some of the approaches (e.g. the Microsoft

precedents/dependents tracer tool, the arrow tool [17] and the S2 and S3 [54] vi-

sualization), the generated arrows with highlighted areas are superimposed on the

spreadsheet display which introduces cluttering on the display. In the other ap-

proaches which produce computational data-flow graphs, the problem of visualizing

large spreadsheets (hence large graphs) is not adequately handled. For instance,

the fluid visualization tool of Igarashi et al. [34] can only handle spreadsheets with

not more than 400 used cells. Hyperbolic viewing of spreadsheet data-flow graphs

as in the work of Ballinger et al. [9] has the problem that the viewing context

generated by the fisheye views employed does not necessarily match with logical

areas in the corresponding spreadsheet. Another drawback with some of the ap-

proaches is that the visualizations produced are statically generated hence cannot

be useful for real-time spreadsheet-visualization interactivity. A case in point is the

work of Clermont et al. [16, 38] and that of Ballinger [9] where information about a

spreadsheet (e.g. cell dependencies) is extracted and processed separately from the

spreadsheet. Online data dependency diagrams as proposed by Davis [17] are also

processed statically from the spreadsheet.

To sum up, we can identify the following problems with the aforementioned spread-

sheet visualization approaches:

• Some of the approaches introduce cluttering of spreadsheet display hence re-

ducing spreadsheet understanding

36
• Some of the approaches do not adequately handle the visualization of large

spreadsheets

• In some of the approaches which produce data-flow graphs, the generated

graphs do not match with logical areas in the corresponding spreadsheet

• Some of the visualizations produced are statically generated hence cannot be

useful for real-time spreadsheet-visualization interactivity

Our research work will be an attempt to address these problems.

37
Chapter 3

Graph-based Visualization

3.1 Introduction

A graph consists two finite sets, V and E. Each element of V is called a vertex

or a node. The elements of set E are called edges and these are unordered pairs

of the vertices in set V . A graph may be used to abstractly represent properties

of a system by modelling and simulation if the vertices can be identified as objects

and edges can be identified as relations between the objects [29]. In our approach,

we model spreadsheet data-flow graphs where vertices (nodes) represent spreadsheet

cells and the set of edges represents the dependencies between spreadsheet cells as

defined by cell references through formulas.

Using different graph drawing techniques, one can generate the data-flow graph

of any given spreadsheet. However, the graph becomes difficult to comprehend and

navigate through due to cluttering of the graph which arises due to the large number

of nodes. Grouping of the nodes into clusters is a viable solution to this problem.

This technique is known as graph clustering.

38
3.2 The need for graph clustering

Consider the spreadsheet given in Fig. 3.1 whose formula view and corresponding

data-flow graph are given in Fig. 3.2 and Fig. 3.3 respectively. The spreadsheet is

used to track income and expenditure on several projects being run by some com-

pany. It is worth noting that it is very difficult to comprehend the data-flow graph of

the spreadsheet. There are problems of readability and navigation just to mention

but a few. This is a general problem of visualizing large graphs [1, 28] since large

graphs contain a large number of nodes. As already stated above, graph clustering

offers a possible possible solution to this problem. Graph clustering is a process of

separating nodes of a graph into components/groups based on some classification

criteria. The separated components are then interpreted as clusters. Clustering is

important because it helps us to view a manageable subset of the generated graph

at a time.

The process of coming up with clusters is known as cluster analysis and it can

be broken down into a series of steps [29]. However, when applying a cluster anaysis

procedure, a number of questions need to be answered [29, 66]:

• What are the entities to be clustered?

• When are two entities said to be similar? This is the classification criteria that

determines if two entities fall under one cluster.

• What is the basis for valuating the classification criteria? This is important

because normally it is desired to have classification criteria that produces clus-

39
ters which are “natural” as much as possible.

• What clustering algorithm to apply? This is important because a clustering al-

gorithm is required in order to perform the actual cluster analysis. In addition,

clustering algorithms vary in their effectiveness depending on the application

area for which they are used.

In our case, we need a clustering algorithm that would find clusters that corre-

spond to the logical areas in the given spreadsheet. In other words, the clustering

mechanism has to find “natural” clusters in the spreadsheet data-flow graph.

Figure 3.1: A sample Project Accounting spreadsheet. Adapted from [38].

3.3 An overview of clustering algorithms

There are so many clustering algorithms. However, clustering algorithms can roughly

be divided in the following categories [29, 66]:

• optimization algorithms

40
Figure 3.2: The formula view of the Project Accounting spreadsheet

Figure 3.3: A data flow graph of the given Project Accounting spreadsheet generated
by the Graphael graph drawing software.

41
• construction algorithms

• hierarchical algorithms

• graph theoretical algorithms

This categorization is not exhaustive as there are some algorithms which might not

fall in the listed categories. There are also other algorithms which are a hybrid of

other algorithms. However, it is important to note that the algorithms may either

produce disjoint or overlapping clusters. In disjoint clusters, a sample of entities is

split into non-overlapping subsets. Thus, every element of the sample is attached to

exactly one cluster. On the other hand, for overlapping clusters, a sample of objects

may be split into overlapping subsets.

Clustering algorithms can also be either supervised or unsupervised [66]. Super-

vised algorithms are provided a priori knowledge while this is not the case with

unsupervised algorithms. An example of a priori knowledge could be the number

of clusters that need to be generated.

3.3.1 Optimization algorithms

These algorithms are based on a clustering criterion that needs to be optimized.

The clustering criterion is expressed as a quality function. Clusters are produced

at the optimal value of the quality function. The main drawback with optimiza-

tion algorithms is the computation time needed to find an optimum of the quality

function.

42
3.3.2 Construction algorithms

In construction algorithms, the cluster construction process starts with an arbitrary

choice of some elements which are believed to be “typical” representatives of some

clusters. These representative elements are called kernels. Using an iterative process,

all elements which are geometrically nearest to each kernel are attached to the

kernel’s group. The process stops if the clusters become too heterogeneous. During

the iterative process, elements which get closer to the geometric centre of a new

group than to the centre of the group they previously have been attached to are

reclassified. This reclassification implies heavy computations.

3.3.3 Hierarchical algorithms

Hierarchical algorithms build a hierarchy of clusterings as in a genetic tree (den-

drogram) whereby each level in the hierarchy contains the same clusters as the first

lower level except for two clusters which are joined to form one cluster. Hierarchical

algorithms may be categorized into two types:

• Agglomerative or bottom-up algorithms

• Divisive top-down algorithms

In agglomerative algorithms, clusters at a higher level are formed by the fusion of

clusters which are at a lower level in the hierarchy. The starting point are single-

membered clusters which are at the lowest level of the hierarchy. On the other hand,

divisive algorithms are a complete opposite of agglomerative algorithms. They start

the clustering process by having all entities contained in one cluster. Thereafter,

43
in each iterative step, a cluster is split into two clusters until the lowest level of

the hierarchy contain single-membered clusters. Agglomerative algorithms offer an

advantage over divisive algorithms because it is computationally cheaper to perform

a bottom-up clustering process than a top-down clustering process.

3.3.4 Graph theoretical algorithms

Graph theoretical algorithms work on graphs whereby nodes represent entities and

edges represent entity relations. These algorithms do not start from the individual

nodes but they try to find subgraphs which will form clusters. Examples of sub-

graphs include connected components and spanning trees. The algorithms used to

find these subgraphs are based on graph theory. Some graph theoretical algorithms

reduce the number of nodes in a graph by merging them into aggregate nodes which

can be interpreted as nodes or can be used as input for a new iteration resulting in

higher level aggregates.

3.4 Choice of clustering algorithm

In our case, we need a clustering algorithm that would find clusters that correspond

to logical areas in spreadsheets. A logical area in a spreadsheet may be defined

as a group of cells in a spreadsheet that from the spreadsheet creator perspective

form a logical unit due to the semantics of the spreadsheet [35, 36]. The seman-

tics of a spreadsheet define what the spreadsheet is all about (the meaning of the

spreadsheet). Therefore the clustering algorithm has to find “natural” clusters in

44
the spreadsheet data-flow graph.

Based on our experiments, we found out that the Markov Clustering (MCL) algo-

rithm [61, 62] finds “natural” clusters in spreadsheet data-flow graphs. We present

a detailed description of these experiments in Chapter 4. Generally, natural clusters

in a graph are characterised by the presence of many edges between the members

of that cluster and one expects that random walks on the graph will infrequently

go from one natural cluster to another [61, 62]. Due to its ability to find natural

clusters, the MCL algorithm has also been used in many advanced applications. For

example, the algorithm has been reliably used for the assignment of proteins into

families based on precomputed sequence similarity information [22].

3.4.1 An overview of the MCL algorithm

The Markov Clustering (MCL) algorithm is a graph clustering algorithm that is

based on column stochastic (Markov) matrices to simulate random walks through a

graph. A column stochastic matrix is a matrix whose column vectors are probabili-

ties i.e. the sum of the matrix entries in each column is 1.

The first step of the algorithm is to associate a given input graph with some column

stochastic matrix, M , such that entry Mij will indicate the probability of moving

from node j to node i in the input graph (note that we start columnwise). Then two

operations known as expansion and inflation are performed iteratively starting with

the associated stochastic matrix thus simulating random walks through the input

45
graph.

An expansion operation is carried out by taking the power of the associated stochas-

tic matrix using the normal matrix product. An inflation operation involves tak-

ing the Hadamard power of the matrix result from the expansion operation. The

Hadamard power of a matrix is computed by taking the powers of each matrix entry.

The Hadamard power of the matrix is specified using what is known as the inflation

operator. This is followed by normalizing or scaling the resulting matrix so that

we have a stochastic matrix again. The process of expansion and inflation are then

repeated iteratively jointly together. The iterative process is stopped after we get

a doubly-idempotent matrix. A doubly-idempotent matrix does not change after

further expansion and inflation operations.

Expansion computes random walks of higher length paths. That is, given any pair

of nodes we will have an associated probability value depicting the probability of

having a higher length path between the two nodes. But since we have more higher

length paths within clusters than between different clusters, node pairs in the same

cluster will have large probabilities since there are so many ways of going from one

node to the other. The probabilities of random walks with higher length paths are

further boosted by applying inflation operation. Thus inflation boosts the probabil-

ities of intra-cluster walks and demotes inter-cluster walks. Intra-cluster walks are

therefore more favoured than inter-cluster walks.

46
The process of jointly iterating expansion and inflation results in a very sparse

stochastic matrix which is interpreted as the separation of the input graph into dif-

ferent connected components which are in turn interpreted as clusters. An example

graphical representation of the MCL cluster separation process is given in Fig. 3.4.

An MCL cluster would therefore be characterized by the following attributes:

Figure 3.4: An example MCL cluster separation process from van Dongen [61].

• the presence of many edges between members of a cluster

• the number of higher-length paths between two arbitrary nodes in the cluster

is large than between two arbitrary pair of nodes from different clusters

• if one takes a random walk through a dense cluster then the random walker

will likely not leave the cluster until many of its nodes have been visited.

47
The basic MCL algorithm is given in Algorithm 1 below. It is important to note

Algorithm 1 The basic MCL algorithm


1: G is the input graph
2: set M1 to be the associated matrix of random walks on graph G
3: set the inflation operator Γ to some value
4: repeat
5: M2 = M1 ∗ M1 //this is expansion
6: M1 = Γ(M2 ) //this is inflation
7: change = difference(M1 , M2 )
8: until (change = 0) //zero matrix
9: set clusters as the components of M1

that the inflation operator can be altered using the parameter Γ. Increasing this

parameter has the effect of making the inflation operator stronger, and this increases

the granularity or tightness of clusters. In addition to this, it is also important to

note that the MCL algorithm has been proven to converge quadratically. In practice,

the algorithm starts to converge noticeably after 3 to 10 iterations [62].

3.5 Choice of graph drawing software

Graph-based visualization is a way of representing structural information as dia-

grams of abstract graphs and networks. It is tedious to draw such kind of graphs

by hand. Therefore, automatic drawing of these kind of graphs is done using graph

drawing software. Graph drawing software usually have a variety of graph layout

algorithms. Different graph drawing software has been used in a wide variety of im-

portant applications in software engineering, database and web design, networking,

and in visual interfaces for many other domains.

48
We investigated two open-source Java-based graph drawing programs in this work.

These are ZGRViewer [47] and the Graphael [26, 30] programs. Each of the programs

were used in collaboration with Microsoft Excel spreadsheet application program.

We used open-source programs because they were not only free in terms of monetory

costs but most importantly because we were able to modify the source code of the

programs to suit our needs.

3.5.1 Experiments with the ZGRViewer graph drawing soft-

ware

ZGRViewer is a 2.5D graph visualizing program implemented in Java. It is specif-

ically aimed at displaying graphs expressed in the DOT graph modelling language

using the GraphViz [20, 31] graph drawing library. ZGRViewer is designed to han-

dle large graphs, and offers a zoomable user interface (ZUI), which enables smooth

zooming and easy navigation in the visualized structure. A screenshot of ZGRViewer

displaying a spreadsheet data-flow graph is given in Fig. 3.5. A zoomed in screen-

shot of the same spreadsheet data-flow graph is given in Fig. 3.6. Despite the

fact that ZGRViewer is able to efficiently handle large graphs through smooth and

continuous geometric zooming (zooming in/out) as illustrated in Fig. 3.6, it has

some shortcomings:

• The whole context of the graph is lost as one zooms in to get a detailed view

of a part of the graph (see Fig. 3.6).

• To deal with the problem of visualizing large graphs, graph clustering becomes

49
Figure 3.5: A screenshot of the ZGRViewer graph drawing software displaying an
unzoomed data-flow graph of a spreadsheet.

Figure 3.6: A screenshot of a zoomed-in spreadsheet data-flow graph in ZGRViewer.

50
a potential solution. Graph clustering allows us to view a subset of the whole

graph at a particular time. Unfortunately, the graph drawing libraries which

ZGRViewer uses, do not have any graph clustering algorithm implemented in

them.

We therefore experimented with another graph drawing software, Graphael.

3.5.2 Experiments with the Graphael graph drawing soft-

ware

The Graphael program has a number of graph clustering algorithms. The Graphael

program has a geometric graph clustering algorithm as well the MCL algorithm

implemented. A geometric clustering algorithm clusters nodes according to their

spacial locality given an initial layout of the entire graph.

Needless to say, the geometric clustering algorithm does not produce clusters that

match with logical areas in the corresponding spreadsheet. On the other hand, our

experiments with the MCL algorithm showed that the clusters produced would in

most cases match with logical areas in the corresponding spreadsheet. A detailed

discussion of our experiments with the MCL algorithm is given in Chapter 4.

51
Chapter 4

The MCL Algorithm and Logical


Areas in Spreadsheets

4.1 Introduction

We conducted experiments on several spreadsheets to determine the performance of

the MCL algorithm in finding “natural clusters” in spreadsheet data-flow graphs.

We also determined whether the “natural clusters” match with logical areas in the

corresponding spreadsheet. In this chapter, we present details of the experiments

and our findings.

4.2 Generating spreadsheet data-flow graphs using

Graphael

For the Project Accounting spreadsheet, given Fig. 4.1 and its corresponding for-

mula view given in Fig. 4.2, we generated its corresponding data-flow graph using

the Graphael program. The spreadsheet is used to track income and expenditure

for some projects being run by some company. To avoid the problem of graph clut-

tering, Graphael provides the MCL algorithm to generate clusters. Conceptually,

52
the generated clusters are hierarchically arranged in a cluster tree. An illustration

of a cluster tree is given in Fig. 4.3. The leaves of a cluster tree are the actual

Figure 4.1: The sample Project Accounting spreadsheet

Figure 4.2: Formula view of the Project Accounting spreadsheet.

nodes of the generated graph while the rest of the higher-level nodes of the cluster

tree represent clusters. The root of the tree is the highest-level cluster of the graph.

In the Graphael program, navigation through the cluster tree is achieved by using

compound fisheye views and treemaps [1]. Fisheye views are a graph visualization

technique which allows one to view a graph as a whole at once while at the same

time providing the ability to the viewer to see detailed parts of the graph without

53
Figure 4.3: An illustration of a cluster tree

Figure 4.4: A top-most level view of the cluster tree of the Project Accounting
spreadsheet data-flow graph as displayed using Graphael.

losing the overall context of the graph. Compound fisheye views is a fisheye view

technique provided by Graphael that enables one to view members of a particular

cluster while at the same time showing any relationships between the cluster mem-

bers and the rest of the clusters in the cluster tree. On the other hand, treemaps

are a visualization technique in which hierarchical information is displayed within

nested rectangles, with each level of nesting corresponding to a level of hierarchical

decomposition. In our case, the cluster tree is also displayed using nested rectangles.

54
Using the Graphael program, the cluster tree of the spreadsheet data-flow graph

is visualized using two windows which are displayed side by side as in Fig. 4.4.

The right-side window is the cluster window while the left-side window is a treemap

window. The cluster window is displaying the root node of the cluster tree which is

represented by a dot. The treemap window is an important complementary cluster

tree navigation aid because it not only helps in determining the level we are at while

navigating the cluster tree in the cluster window but it also indicates the number

of nodes which are in a selected cluster. We know the level we are at when using a

treemap window by counting the number of thickened rectangular borders from the

outermost border to the currently highlighted thickened border.

Clicking on the root node of the cluster tree as depicted in the cluster window

in Fig. 4.4 leads to the display of the nodes (clusters) at the next lower level of the

cluster tree as depicted in Fig. 4.5. On the other hand, right-clicking on any node

in the currently displayed cluster leads to viewing of nodes which are at the next

higher level in the cluster tree (going up the cluster tree).

In our case, a look at the corresponding treemap in Fig. 4.5 shows that the next

lower-level nodes are leaf nodes. Therefore, clicking on any node (cluster) in Fig.

4.5 should lead to leaf nodes in that particular cluster. For example, in Fig. 4.6,

we have a cluster containing cells D6, F6, G6 and H6 and these are depicted by la-

belled nodes. The unlabelled nodes indicate clusters which are not currently under

55
selection. Fig. 4.7 depicts a cluster with cells F10, G10 and H10.

Figure 4.5: Second level view of the cluster tree.

Figure 4.6: An MCL cluster containing cells D6, F6, G6, H6

Compound fisheye views help us to know the relationship between cluster members

currently being viewed in relation to other cluster members and unselected clusters.

For example, in Fig. 4.6, cluster member G6 is linked to three nodes: F6, D6 and

an unlabelled cluster. We can view these details without loosing the overall context

of clusters which are at a particular level in the cluster tree.

56
Figure 4.7: An MCL cluster containing cells F10, G10 and H10

4.3 Determining the inflation operator for the MCL

algorithm

The size of MCL clusters is dependent on the value of the inflation operator [62].

According to the MCL algorithm, the inflation operator, Γ, has to be greater than

1 (Γ > 1). As Γ values get larger, we would expect tighter (smaller-sized) clusters.

It is also expected that as Γ becomes smaller, we should get bigger-sized clusters.

We then setout an experiment to find the best value of the inflation operator, Γ,

that gives us MCL clusters that closely match with logical areas in a given spread-

sheet. We used the same Project Accounting spreadsheet given in Fig. 4.1 for our

experiments. The formula view of the spreadsheet is also given in Fig. 4.2. The

result of the experiment is given below:

With Γ = 1.1 (Γ > 1):

The corresponding treemap and cluster tree is depicted in Fig. 4.8. The resulting

57
clusters are summarized in Table 4.1. Clearly, we do not get any meaningful MCL

clusters (compare tabulated clusters with the formula view of the spreadsheet in

Fig.4.2).

Figure 4.8: Treemap and cluster tree with Γ = 1.1

Cluster No. Member Cells


1. B5:B14
2. the rest of the spreadsheet

Table 4.1: MCL clusters for the Project Accounting spreadsheet with Γ = 1.1

With Γ = 1.5 (Γ > 1):

The corresponding treemap and cluster tree is given in Fig. 4.9. Refer to Table 4.2

for a summary of identified clusters. A comparison analysis of the tabulated MCL

clusters with the formula view of the spreadsheet in Fig.4.2 shows a mismatch with

most logical areas.

58
Figure 4.9: Treemap and cluster tree with Γ = 1.5

Cluster No. Member Cells


1. B5:B10
2. B11:B14
3. F5, F15
4. H5, H10, H15
5. E5:E15, F6:F14, I5:I14
6. D6:D9, D11:D14, G5:G15, H6:H9

Table 4.2: MCL clusters for the Project Accounting spreadsheet with Γ = 1.5

With Γ = 2.0 (Γ > 1):

The corresponding treemap and cluster tree is depicted in Fig. 4.10. Identified MCL

clusters with Γ = 2.0 are listed in Table 4.3. A comparison of the identified clusters

with the formula view of the spreadsheet show matches with most logical areas n

the spreadsheet.

With Γ = 2.5 (Γ > 1):

Consider the treemap (left window) in Fig. 4.11. It is clear from the treemap that

we have so many MCL clusters which have either one member, two members or three

members. For example, one single membered cluster contains cell F7. An example

59
Figure 4.10: Treemap and cluster tree with Γ = 2.0

Cluster No. Member Cells


1. D6, F6, G6, H6
2. D7, F7, G7, H7
3. D8, F8, G8, H8
4. D9, F9, G9, H9
5. F10, G10, H10
6. D11, F11, G11, H11
7. D12, F12, G12, H12
8. D13, F13, G13, H13
9. D14, F14, G14, H14
10. E5:E15, I14
11. F5, F15
12. G5, G15
13. H5, H15
14. I5, I6
15. I7
16. I8
17. I9
18. I10
19. I11
20. I12
21. I13
22. B5:B8
23. B9, B10
24. B11:B14

Table 4.3: MCL clusters for the Project Accounting spreadsheet with Γ = 2.0

60
of an identified two membered cluster contains cells, E10 and I10. An example of a

three-membered cluster is a cluster containing cells, B12, B13 and B14. Clearly we

can not get any meaningful clusters.

Figure 4.11: Treemap and cluster tree with Γ = 2.5

With Γ = 3.0 (Γ > 1):

We get the treemap and cluster tree as in Fig. 4.12. Again, we have so many

smaller-sized clusters.

With Γ = 5.0 (Γ > 1):

We get the treemap and cluster tree as in Fig. 4.13. Again, we have so many

smaller-sized clusters. The largest cluster in this case has got only three member

cells (see the treemap).

61
Figure 4.12: Treemap and cluster tree with Γ = 3.0

Figure 4.13: Treemap and cluster tree with Γ = 5.0

With Γ = 7.0 (Γ > 1):

We get the treemap and cluster tree as in Fig. 4.14. Again, we have so many

smaller-sized clusters. The largest cluster in this case has got only two member cells

(see the treemap).

62
Figure 4.14: Treemap and cluster tree with Γ = 7.0

4.3.1 Discussion of experiment results

Using the same analysis technique as with the Project Accounting spreadsheet, we

extended our experiment with more spreadsheets. Our results show that the in-

flation operator, Γ = 2, gives clusters that better match with logical areas in the

spreadsheet. Values less than 2 (Γ < 2) give us bigger-sized clusters which do not

match with logical areas in the spreadsheet. On the other hand, values greater than

2 (Γ > 2) give us many smaller (tighter) clusters which are not useful either.

For the Project Accounting spreadsheet, MCL clusters identified when Γ = 2 as

indicated Table 4.3 are highlighted with different cell background colours and cell

border styles in the spreadsheet as in Fig. 4.15 and Fig. 4.16.

63
Figure 4.15: The Project Accounting spreadsheet showing highlighted MCL clusters
(when Γ = 2)

Figure 4.16: The formula view of the Project Accounting spreadsheet with high-
lighted MCL clusters (when Γ = 2)

4.4 Testing the efficacy of the MCL algorithm on

more spreadsheets

To test the efficacy of the MCL algorithm, we run the algorithm on one more spread-

sheet while maintaining the inflation operator, Γ = 2. The sample spreadsheet used

64
is the Consolidated Balance Sheet depicted in Fig. 4.17. The formula view of the

spreadsheet is given in Fig. 4.18. A treemap and cluster tree for the spreadsheet

depicting a cluster with cell members F34, F35, F36, F37, F38, F39 and F40 is

depicted in Fig. 4.19.

Table 4.4 is a summary of identified MCL clusters for the spreadsheet. For each

cluster, we also determine the degree of conformance for each cluster. We define

the degree of conformance in terms of the number of cells in an MCL cluster and

the number of cells which are supposed to be in the corresponding logical area in a

spreadsheet. For example, the degree of conformance for cluster 2 is 6/8. This is

interpreted as follows: The corresponding logical area for this cluster is supposed

to have 8 cells, but cluster 2 contains only 6 of the 8 cells. A similar interpretation

goes for all the other clusters indicated in Table 4.4. The identified MCL clusters

for the Consolidated Balance Sheet spreadsheet are then highlighted as in Fig. 4.20

and Fig. 4.21.

4.4.1 Discussion of experiment results

The MCL algorithm was able to identify clusters that match with logical areas in the

Consolidated Balance Sheet spreadsheet. Referring to Table 4.4, minor deviations

in clusters 2, 5, 8 and 12 occur because a cell can only belong to one MCL cluster

where the cell has higher probability of being visited in a random walk. Based on

the degree of conformance as a measure, we can conclude that the performance of

the MCL algorithm is satisfactory.

65
Figure 4.17: The Consolidated Balance Sheet spreadsheet from the EUSES spread-
sheet corpus [25]

66
Figure 4.18: The formula view of the Consolidated Balance Sheet spreadsheet

67
Figure 4.19: A treemap and cluster tree for the Consolidated Balance Sheet depicting
a cluster with cell members, F34, F35, F36, F37, F38, F39 and F40

Cluster No. Member Cells degree of Comments


conformance
1. F50, F54:F60 8/8
2. F42:F46, F61 6/8 F40 and F60 are left out
because they have been put
in other clusters
3. F34:F40 7/7
4. E21:E23 3/3
5. E19, E25:E29 6/8 E17 and E23 are left out
because they have been put
in other clusters
6. E9:E17 9/9
7. F21:F23 3/3
8. F19, F25:F29 6/8 F17 and F23 are left out
because they have been put
in other clusters
9. F9:F17 9/9
10. E50, E54:E60 8/8
11. E34:E40 7/7
12. E42:E46, E61 6/8 E40 and E60 are left out
because they have been put
in other clusters

Table 4.4: MCL clusters for the Consolidated Balance Sheet spreadsheet

68
Figure 4.20: The Consolidated Balance Sheet with highlighted (shaded) MCL clus-
ters

69
Figure 4.21: Formula view of the Consolidated Balance Sheet with highlighted
(shaded) MCL clusters

70
Chapter 5

Comprehending and Debugging


Spreadsheets Using MCL Clusters

5.1 Introduction

One of the goals of our spreadsheet visualization tool is to aid in the comprehension

and debugging of spreadsheets. In this chapter, we demonstrate how identified MCL

clusters can be used to serve that purpose through a process of cluster member verifi-

cation. Cluster member verification involves verifying whether the identified clusters

belong to their respective logical areas. The aim of this process is to comprehend

and understand a spreadsheet as well as identify errors (if any) in the spreadsheets.

We use two different spreadsheets in our experiments.

5.2 Analysis of the Project Accounting spread-

sheet

We again consider the Project Accounting spreadsheet in Fig. 5.1 and its corre-

sponding formula view is given in Fig. 5.2 below. Identified MCL clusters for the

spreadsheet given in Table 5.1. The Project Accounting spreadsheet with high-

71
lighted MCL clusters is also given in Fig. 5.3 with a captured Microsoft Excel error

message.

Figure 5.1: The Project Accounting spreadsheet

Figure 5.2: The formula view of the Project Accounting spreadsheet.

5.2.1 Verification of MCL clusters for the Project Account-

ing spreadsheet

Referring to Table 5.1 in conjunction with the the formula view of the Project Ac-

counting spreadsheet in Fig. 5.2, clusters 1 to 9 have members which indeed from

72
Cluster No. Member Cells
1. D6, F6, G6, H6
2. D7, F7, G7, H7
3. D8, F8, G8, H8
4. D9, F9, G9, H9
5. F10, G10, H10
6. D11, F11, G11, H11
7. D12, F12, G12, H12
8. D13, F13, G13, H13
9. D14, F14, G14, H14
10. E5:E15, I14
11. F5, F15
12. G5, G15
13. H5, H15
14. I5, I6
15. I7
16. I8
17. I9
18. I10
19. I11
20. I12
21. I13
22. B5, B6, B7, B8
23. B9, B10
24. B11, B12, B13, B14

Table 5.1: MCL clusters for the Project Accounting spreadsheet

Figure 5.3: Microsoft Excel displays an error message for a cell in MCL cluster
number 5 in the Project Accounting spreadsheet.

73
the user’s point of view fall in their respective logical areas. For example, for cluster

1, cells D6, F6, G6 and H6 are in a logical area relating to “Ted” (see Fig 5.2). We

have similar cases for cluster 2 to cluster 9.

Cluster 5 may draw some interest since its members do not follow the pattern of its

neighbouring clusters. This is because even if we look at the formula view in Fig.

4.16, the formulas of these cells are structurally different from the cells in neigh-

bouring clusters. This is not an error despite the fact that Microsoft Excel produces

an error-warning message (see Fig. 5.3).

Cluster 10 has cell range E5:E15 and cell I14. Cell I14 seems to be the odd one

out in this cluster. But this should not be the case. Cell I14 belongs to this cluster

through the cell dependency as defined by the formula I14=I13+E14-F14 (see for-

mula view in Fig. 5.2). This is an example of a case where it may not be obvious to

the user that cluster members belong to the same logical area. We have this phe-

nomenon because a cell may belong to more than one logical area e.g. column-wise

and row-wise. However, it will only belong to one and only one MCL cluster at a

time. The cell will belong to the cluster where there is higher probability of being

visited in a random walk as defined by the MCL algorithm. In this case, cell I14

has to belong to cluster 10.

Cluster 11, cluster 12 and cluster 13 are also in their respective logical areas. How-

ever, one would expect that for example, that cluster 11 would have cell range

74
F5:F14 as part of this cluster yet cluster 11 has cells F5 and F15 only. Cells F5 to

F14 are members of other clusters. The reason for this phenomenon is that a cell

can only belong to one cluster at a time and it will belong to the cluster where it

has higher probability of being visited in a random walk as defined by the MCL

algorithm. The same explanation goes for clusters 11, 12 and 13.

Cluster 14 has two cells, cell I5 and cell I6, which are connected by the formula

I6=I5+E6-F6. Clusters 15 to 21 are all single-membered. A look at the formula

view suggests that they should belong to one logical area at least from the user’s

perspective.

From the user’s perspective, clusters 22, 23 and 24 should belong to one logical

area containing the cell range B5:B14. However, the MCL algorithm has unnec-

essarily split the logical area into three different clusters. This is an example of a

case where an MCL cluster could sometimes not necessarily match with the user’s

perspective of a logical area.

From Table 5.1, we could say that clusters 1 to 21 match the user’s perspective

of logical area while cluster 22, cluster 23 and cluster 23 provide a mismatch. This

represents a success rate of 87.5% for the MCL algorithm.

75
5.3 Analysis of the IPO spreadsheet

We also consider the IPO spreadsheet given in Fig. 5.4. The spreadsheet is used

to calculate income after tax for some company. This spreadsheet has been seeded

with errors. We list in Table 5.2, the clusters defined by the MCL algorithm for the

IPO spreadsheet. Identified MCL clusters for the IPO spreadsheet are highlighted

in Fig. 5.5. The formula view of the spreadsheet with highlighted clusters is given

in Fig. 5.6.

Figure 5.4: A sample IPO spreadsheet sourced from Ray Panko’s spreadsheet re-
search website[43]

Cluster No. Member Cells


1. B6, B20
2. C6, C20
3. B4, B8:B15, B17, B18, B19, B21
4. C4, C8:C15, C17, C18, C19, C21

Table 5.2: MCL clusters for the IPO spreadsheet given in Fig. 5.4

76
Figure 5.5: The IPO spreadsheet with highlighted MCL clusters.

Figure 5.6: The formula view of the IPO spreadsheet

77
Figure 5.7: IPO spreadsheet with an Microsoft Excel warning message

5.3.1 Verification of MCL clusters for the IPO spreadsheet

Identified MCL clusters presented in Table 5.2 are highlighted in Fig. 5.5. Cluster 1

has two cells, B6 and B20. A look at the formula view of the IPO spreadsheet in Fig.

5.6 indicates that two cells are connected by the formula B20=B6*B17. According

to the MCL algorithm, these two cells belong to the same cluster. But it is up to

the discretion of the user to ask themselves if indeed the cells should really belong

in the same cluster.

In this case, a user should notice that probably what was intended was that “Taxes”

(B20) should be calculated from “Corporate income tax rate” (B5) and “Sales Rev-

enues” (B17) and not from ‘Depreciation rate” (B6) and “Sales Revenues”. Thus

the end-user would notice that this is an error and hence corrections can be made

78
that cell B20 should have the following formula B20=B5*B17. Hence the MCL clus-

tering technique can help the end-user in debugging a spreadsheet. All the end-user

needs to do is to verify if members of a particular MCL cluster logically belong to

the same cluster. A similar analysis goes for cluster 2 which has cell members C6

and C20.

Notice that in Fig. 5.7, Microsoft Excel is producing an error warning message

about cell B20. We feel that the error message is not very helpful as the clue to the

solution of the error is not appropriate. On the other hand, it is easy to deduce the

source of the error from MCL clusters by simply verifying whether the members of

a cluster logically belong to the same cluster.

It is easy to see that cluster 3 and cluster 4 belong to logical areas that match

with the user’s perspective. However, it is interesting to note that cells B5, C5,

B7 and C7 do not belong to any cluster. A closer look at Fig. 5.5 and take note

that these cells have not been highlighted. A spreadsheet developer would therefore

ask himself/herself as to why this is the case. In trying to provide answers to this

question, they would realize that these cells have not been used in any calculation

in the spreadsheet. This is a potential error in the spreadsheet and thus the MCL

clustering technique can help the end-user in finding about the error and therefore

aiding in the spreadsheet debugging process.

79
5.4 Summary of experiment results

The MCL algorithm will most often produce clusters that match with the user’s

perspective of logical areas in spreadsheets. The identified MCL clusters have also

been shown to aid in the spreadsheet debugging process through the process of

cluster membership verification and identification of unused numeric spreadsheet

cells.

80
Chapter 6

Implementation

6.1 Introduction

We implemented the visualization tool using the Microsoft Excel spreadsheet system

in conjuction with the Graphael graph drawing software. Microsoft Excel was chosen

because it is a commonly used spreadsheet system. We chose the Graphael program

because it already has an implementation of the MCL algorithm, our algorithm of

choice for clustering. We did our programming on the Microsoft Excel side using

the programing language, Visual Basic for Applications (VBA). On the other hand,

we had to modify the open-source Java code of Graphael to suit our needs.

6.2 Software architecture of the visualization tool

A conceptual architecture of the spreadsheet visualization tool is given in Figure

6.1. In this conceptual architecture, the system is viewed as an aggregation of co-

operating components which are represented by boxes and arrows. Screenshots of

the prototype of the visualization tool are given in Figure 6.2 and Figure 6.3.

81
In the architecture, whenever a user initiates the cluster generation process by in-

voking an appropriate command from a dropdown menu in the spreadsheet system

interface, the spreadsheet parser module will be run (the “Visualize” menu provides

a list of commands). The spreadsheet parser module is coded in Microsoft Visual

Basic for Applications (VBA). Spreadsheet dependencies are extracted through a

process of rowwise iteration through all used spreadsheet cells. The cell dependency

information is written to a text file in the Graph Modelling Language (GML) [33]

file format. The algorithm for the spreadsheet parser module is given in Algorithm

2 in Appendix A. A sample GML file is also given in Appendix B.

Figure 6.1: Conceptual architecture of the spreadsheet visualization tool

The graph drawing application, Graphael, is then invoked through an Windows

Application Programming Interface (API) command. The invoked Graphael

program then parses the GML text fle for syntactical conformance to the Graph

Modelling Language (GML). This is done using the GML graph parser which is

part of the Graphael program. After successful parsing, a graph is generated in

form of an adjacency matrix. An adjacency matrix is a square matrix, M , such

82
Figure 6.2: A screenshot of the prototype for the visualization with a “Balance
Sheeet” spreadsheet, a cluster window (top-right window) and a treemap window
(bottom-right window).

Figure 6.3: A screenshot of the prototype showing the formula view of the “Balance
Sheet” spreadsheet.

83
that Mij = 1 if and only if (i, j) is an edge in graph G and Mij = 0 otherwise. An

adjacency matrix can be implemented programmatically as an array [1...n, 1...m]

or using any appropriate data structure. From the generated graph, the Markov

clustering (MCL) algorithm is run to produce the required graph clusters.

After the MCL graph clusters have been produced, the clusters are displayed by

organizing them in a cluster tree. The leaves of the cluster tree are nodes (cells) of

a particular cluster. The top-most level view of the cluster tree is a single node. A

cluster tree is displayed at different levels in the cluster display window as in Fig. 6.2.

A right-click or left-click of the mouse over a node helps to navigate up and down

the cluster tree. Nodes which have already been visited are distinguished using dif-

ferent colourings. Green colour is used to indicate an already visited node (see Fig.

6.4). Cell members of a currently selected cluster are labelled with cell information

they represent. Nodes representing clusters which are not currently under selection

are left unlabelled. Linkages between currently selected cluster members with other

clusters are indicated through edge connections amongst them. This is because a

cell may belong to more than one logical area in the corresponding spreadsheet.

This technique is known as compound fisheye views. Compound fisheye views help

us to view members of a particular cluster while showing their linkages with other

clusters. To help in the navigation of the cluster tree, we also use a treemap window

in coordination with the cluster window as in Fig. 6.2. While we are navigating the

cluster tree in the cluster window, we use the treemap to know the depth we are at

in the cluster tree. The treemap also helps to know the number of nodes which are

in a particular cluster without actually navigating onto the cluster.

84
Figure 6.4: A screenshot of the prototype showing the “Balance Sheet” spreadsheet
with highlighted logical areas.

As the user selects on members of a particular cluster, the cluster members are

written into a text file. Upon demand, members of a currently selected cluster in

the cluster window can be highlighted in the spreadsheet . This is done by invoking

an appropriate command from a dropdown menu in the spreadsheet system interface.

Behind the scenes, the spreadsheet cluster/logical area highlighter module, written

in VBA code is run, which uses the cluster member text file to highlight currently

selected cluster cells as in Fig. 6.2. The algorithm for the spreadsheet highlighter

module is given in Algorithm 3 in Appendix A. As the user navigates through the

clusters in the cluster window, the user may repeat the spreadsheet highlighting

process, which will lead to all logical areas being highlighted with different colours

in the spreadsheet as in Fig. 6.4. The user is also aided in navigating clusters in

85
the cluster window by clearly labelling visited nodes with different colourings. This

helps the user to access clusters that need to be visited only.

6.3 Summary

In this chapter, we presented how we implemented the prototype of the spreadsheet

visualization tool. We paid particular attention to demonstrate how the visualization

tool works with reference to its conceptual architecture.

86
Chapter 7

Discussion

7.1 Introduction

In this research work, we focussed on producing a graph-based spreadsheet visualiza-

tion tool that would not only aid in the understanding of a spreadsheet but also aid

in the debugging and maintenance of spreadsheets. However, as in any visualization

tool, we also needed to incorporate human-computer interaction (HCI) aspects in

trying to achieve the goals of the visualization tool. In the sequel, we present how

we attempted to address these issues in our work.

7.2 Spreadsheet understanding and comprehension

Understanding and comprehension of spreadsheets can be a daunting task especially

when one just uses the superficial numerical (value) view of a spreadsheet. Under-

standing and comprehending a spreadsheet might be necessary when one tries to un-

derstand a spreadsheet developed by others. One of the purposes of our spreadsheet

visualization tool was to use information from the underlying spreadsheet data-flow

graph to aid spreadsheet programmers/users in understanding their spreadsheets.

87
We achieved this through the use of a graph clustering algorithm that produces

graph clusters that match with logical areas in the original spreadsheet. The iden-

tified clusters are then highlighted using different cell background colours on the

original spreadsheet.

Hence, instead of looking at the spreadsheet as a whole at once, the user focusses

his/her attention on each highlighted logical area at a time. The spreadsheet un-

derstanding process is therefore properly guided since the focus area matches with

what the user might perceive to be a logical area. In addition, the user has also an

option to analyze cell members of a particular cluster on the graph cluster window

other than the spreadsheet.

7.3 The spreadsheet debugging process

Debugging a spreadsheet is also a daunting task since the numerical (value) view

of spreadsheet hides the computational details in a spreadsheet. Although one can

access details of how spreadsheet computations are done through the formula view

of a spreadsheet, arbitrary explicit cell-by-cell inspection through cell formula is also

challenging. Examining the corresponding data-flow graph provides a solution to

this problem. We, therefore, used information from a spreadsheet data-flow graph in

the spreadsheet debugging process by first generating graph clusters using the MCL

algorithm and then highlighting the identified clusters in the original spreadsheet.

The clusters correspond to logical areas in the spreadsheet. We then demonstrated

88
how through a process of cluster member verification, one can identify some types of

errors in the spreadsheet. Cluster member verification involves analysing whether a

cell belongs to a particular logical area or not. Cell formulas in a particular logical

area are also analysed through the same process. Unused numerical cells in the

spreadsheet are also easily identified since they are not highlighted in the spread-

sheet. This is because they are not part of the spreadsheet data-flow graph since

they are not part of any cell formula.

However, we take note that there are other types of spreadsheet errors which we

can not identify using our visualization tool. For example, if a user enters a wrong

value of a cell due to a typographical error, it might be difficult to isolate such

kind of an error. We therefore propose that we use the tool with other existing

spreadsheet debugging techniques such as the use of assertions in spreadsheets [12].

Assertions help in making sure that numerical cells have expected values.

7.4 Spreadsheet maintenance

Non-trivial spreadsheets undergo maintenance cycles as in conventional software.

However, spreadsheets get larger in size as they undergo such maintenance routines.

We therefore also need a spreadsheet visualization tool that should be able to handle

large spreadsheets. Large spreadsheets result in large data-flow graphs which lead to

problems of graph navigability. We handled this problem through a graph clustering

algorithm that is designed to scale to large graphs. In particular, we used the MCL

89
algorithm. Our experiments showed that it was indeed scaling well to large graphs.

In addition, the spreadsheet tool might also be used in creating spreadsheet doc-

umentation artefacts since one can capture and store cell dependency information

at a particular time through the use of external graph definition text files. The

documentation artefacts could then be used in tracking changes to spreadsheets as

the spreadsheet evolves.

7.5 Addressing HCI aspects

We were able to generate spreadsheet data-flow graphs of any given spreadsheet

and then display the graph separately from the spreadsheet window. We separated

the data-flow graph from the spreadsheet because we believe that superimposing

the graph over the spreadsheet clutters the view of the spreadsheet. However, we

tried to maintain the mapping between the spreadsheet and the graph by labelling

graph nodes with corresponding familiar spreadsheet cell addresses such as “A1”.

All spreadsheet cells with formulas also had their corresponding graph nodes labelled

with their formula definitions.

The graphs can also be regenerated anytime the user wishes to do so by just clicking

on a command button on the spreadsheet system interface. This dynamism in graph

generation is important so that the user could be working on a spreasheet while at

the same time accessing the corresponding graph on the other side thus achieving

90
real-time spreadsheet-graph interactivity.

To deal with the problem of visualizing large spreadsheet graphs (which comes from

large spreadsheets), we successfully employed the MCL algorithm which was satis-

factorily able to find “natural” clusters in the graphs which are then highlighted in

the corresponding spreadsheet. It was important to find a clustering algorithm that

finds clusters that match with logical areas in the corresponding spreadsheet. This

is because we did not want a clustering algorithm that produces “meaningless” clus-

ters since that will lead to the incomprehensibility of the spreadsheet thus defeating

one of the purposes of the spreadsheet visualization tool.

To help in the navigation of a generated data-flow graph we employed two comple-

mentary windows for the visualization: the cluster window and the treemap window.

The generated MCL clusters are arranged in cluster tree which is displayed in the

cluster window. The cluster window displays the generated MCL clusters as nodes

with the root of the cluster tree being represented as a node which is used as the

starting point for graph navigation. On the other hand, a treemap is a visualization

technique in which hierarchical information is displayed within nested rectangles,

with each level of nesting corresponding to a level of hierarchical decomposition. In

our case, the cluster tree is a hierarchical decomposition of the data-flow graph and

as a result the treemap is complementary navigation aid as one accesses the graph

in the cluster window. Treemaps not only help to visualize the depth we are at

while navigating a cluster tree but also indicate the number of member nodes in a a

91
selected cluster. However, we note that the inclusion of the treemap window on the

display introduces three different windows (i.e. the spreadsheet window, the cluster

window and the treemap window). This might lead to the problem of information

overload on the part of the viewer which we have to investigate further.

7.6 Summary

In this chapter, we discussed some of the issues being addressed in our graph-based

spreadsheet visualization. We have discussed how our visualization attempts to

address the problem of spreadsheet understanding and comprehension as well as

debugging. We also highlighted how several human-computer interaction(HCI) con-

cerns are addressed.

92
Chapter 8

Conclusion

8.1 A summary of the research work

In this research work, we presented a graph-based spreadsheet visualization that can

simplify the task of debugging and understanding (hence maintenance) of spread-

sheets. In particular, we tried to address the following three important aspects:

(i) Provision of a graph-based visualization that is on a separate window from

the original spreadsheet. The main purpose of separating the graph from the

spreadsheet is to avoid information overload on the user due to the cluttering

of the spreadsheet view. However, we note that an issue that can be raised

is the difficulty in the mapping between the spreadsheet and the graph. We

handled this by dynamically generating the graph from the spreadsheet. In

addition, we showed the link between spreadsheet cells and the corresponding

graph nodes by labelling graph nodes using cell addresses.

(ii) Application of a clustering algorithm to handle the visualization of large spread-

sheets (which lead to large data-flow graphs). This has been handled with the

MCL algorithm which is one of the algorithms specifically developed to handle

93
the visualization of large graphs. We tried to improve navigability of graph

clusters using compound fisheye views and treemaps.

(iii) Provision of a clustering algorithm which identifies graph clusters that match

with logical areas in the original spreadsheet. We achieved this by using the

MCL algorithm. We observed through our experiments that the algorithm

satisfactorily identifies graph clusters that match with logical areas in spread-

sheets. This is a novel way of finding logical areas in spreadsheets since the

logical areas are found without necessarily looking at structural similarity of

cell formulas.

8.2 Our contribution

The following are the main contributions of this research work:

(i) We have developed a prototype tool that dynamically generates spreadsheet

data-flow graphs which are separated from the spreadsheets. This is in contrast

to graph-based spreadsheet visualization techniques that process spreadsheet

graphs statically and separately from the original spreadsheet.

(ii) We have used a novel way of visualizing large spreadsheet data-flow graphs

by successfully employing the MCL algorithm to find clusters in the data-flow

graphs that match with logical areas in spreadsheets (Logical areas are not

necessarily derived from structural similarity of cell formulas).

(iii) We have also demonstrated how the graph-based visualization using the MCL

94
algorithm can assist in understanding and debugging spreadsheets.

8.3 Limitations

We have used two different software applications hand in hand to produce the vi-

sualization tool. This in itself is a disadvantage because the software applications

use non-compatible programming languages i.e. VBA for Microsoft Excel and Java

for the Graphael program. This meant that we had to find a VBA-Java application

programming interface (API) implementation. Unfortunately, we found none that

could provide for realtime spreadsheet-visualization interaction. This necessitated

that we had to use text files as a means of communication between Microsoft Excel

and the Graphael program. We also had to use a text file file to submit the input

graph to the graph drawing software because currently graph drawing software ac-

cepts only graph definitions in files which can later be parsed and thereafter a graph

is generated. Similarly, for the cluster highlighting process, cluster member names

are written in a text file by the Graphael program afterwhich the file is accessed

by Microsoft Excel and then the cluster members are highlighted in the spreadsheet.

Writing and reading text files brings in a computation overhead which affects the

spreadsheet-visualization interaction response time. The spreadsheet-visualization

interaction response time would have been improved if the graph drawing procedure

would have been implemented as part of Microsoft Excel. Unfortuately, VBA is

not a powerful programming language to handle advanced algorithms like the MCL

95
algorithm which need complex data structures.

8.4 Future work

In order to address some of the limitations as well as introducing new features in

our work, we plan to go in the following research direction:

(i) We plan to import the MCL algorithm and other graph drawing procedures into

the Microsoft Excel spreadsheet system by using compatible programming lan-

guages such as Microsoft C#. This will eliminate the need for separate graph

drawing software which we envisage might improve spreadsheet-visualization

response time.

(ii) We also plan to conduct experiments to investigate the computation overhead

of the visualization tool

(iii) We also plan to conduct trials of the visualization tool with spreadsheet users

to gauge the usefulness and usability of the tool.

96
Bibliography

[1] Abello, J., Kobourov, S. G., and Yusufov, R. Visualizing Large Graphs
with Compound-Fisheye Views and Treemaps. In Proceedings of the 12th In-
ternational Symposium on Graph Drawing (2004), pp. 431–441.

[2] Abraham, R., and Erwig, M. Header and Unit Interference through Spatial
Analyses. In IEEE International Symposium on Visual Languages and Human-
Centric Computing (2004), IEEE, pp. 165–172.

[3] Abraham, R., and Erwig, M. Goal-Directed Debugging of Spreadsheets.


In Proceedings of the 2005 IEEE Symposium on Visual Languages and Human-
Centric Computing (VL/HCC’05) (Washington, DC, USA, 2005), IEEE Com-
puter Society, pp. 37–44.

[4] Abraham, R., and Erwig, M. How to Communicate Unit Error Messages
in Spreadsheets. In Proceedings of the First Workshop on End-User Software
Engineering (New York, NY, USA, 2005), ACM, pp. 1–5.

[5] Abraham, R., and Erwig, M. Type Inference for Spreadsheets. In Pro-
ceedings of the 8th ACM SIGPLAN Symposium on Principles and Practice of
Declarative Programming (New York, NY, USA, 2006), ACM, pp. 73–84.

[6] Ayalew, Y. A User-Centred Approach for Testing Spreadsheets. International


Journal of Computing and ICT Research 1, 1 (2007), 77–85.

[7] Ayalew, Y., Clermont, M., and Mittermeir, R. T. Detecting Errors


in Spreadsheets. In Proceedings of the 1st European Spreadsheet Risks Interest
Group Symposium: Spreadsheet Risks, Audit and Development Methods (Lon-
don, UK, 2000).

[8] Ayalew, Y., and Mittermeir, R. T. Spreadsheet Debugging. In Proceed-


ings of the 4th European Spreadsheet Risks Interest Group Symposium (Dublin,
Ireland, 2003).

[9] Ballinger, D., Biddle, R., and Noble, J. Spreadsheet Structure Inspec-
tion Using Low Level Access and Visualisation. In Proceedings of the Fourth
Australasian Conference on User Interfaces (Darlinghurst, Australia, 2003),
Australian Computer Society, Inc., pp. 91–94.

97
[10] Blackwell, A. F. What is Programming? In Proceedings of the 14th Work-
shop of the Psychology of Programming Interest Group, J. Kuljis, L. Baldwin,
and R. Scoble, Eds. PPIG, London, UK, 2002, pp. 204–218.
[11] Burnett, M., atwood, J., Djang, R. W., Gottfried, H., Reichwein,
J., and Yang, S. Forms/3: A First Order Visual Language to Explore the
Boundaries of the Spreadsheet Paradigm. Journal of Functional Programming
11, 2 (2001), 155–206.
[12] Burnett, M., Cook, C., Pendse, O., Rothermel, G., Summet, J.,
and Wallace, C. End-User Software Engineering with Assertions in the
Spreadsheet Paradigm. In ICSE ’03: Proceedings of the 25th International Con-
ference on Software Engineering (Washington, DC, USA, 2003), IEEE Com-
puter Society, pp. 93–103.
[13] Burnett, M., Cook, C., and Rothermel, G. End-User Software Engi-
neering. Commun. ACM 47, 9 (2004), 53–58.
[14] Chen, T. Y., Kuo, F.-C., and Zhou, Z. Q. An Effective Testing Method
for End-User programmers. In Proceedings of the First Workshop on End-user
Software Engineering (New York, NY, USA, 2005), ACM, pp. 1–5.
[15] Clermont, M. Heuristics for the Automatic Identification of Irregularities
in Spreadsheets. In Proceedings of the First Workshop on End-user Software
Engineering (New York, NY, USA, 2005), ACM, pp. 1–6.
[16] Clermont, M., Hanin, C., and Mittermeir, R. T. A Spreadsheet Tool
Evaluated in an Industrial Context. In Proceedings of the 3rd European Spread-
sheet Risks Interest Group Symposium (Cardiff, Wales, 2002).
[17] Davis, J. S. Tools for Spreadsheet Auditing. International Journal of Human-
Computer Studies 45, 4 (1996), 429–442.
[18] Deligiannidis, L., Kochut, K. J., and Sheth, A. P. User-centered
Incremental Data Exploration and Visualizaton. Tech. rep., LSDIS Lab and
Computer Science, University of Georgia, Anthens, USA, 2006.
[19] Di-Battista, G., Eades, P., Tamassia, R., and Tollis, I. G. Graph
Drawing: Algorithms for the Visualization of Graphs. Prentice–Hall, Upper
Saddle River, New Jersey, USA, 1999.
[20] Ellson, J., Gansner, E., Koutsofios, L., North, S. C., and Wood-
hull, G. Graphviz Open Source Graph Drawing Tools. In Graph Draw-
ing, Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2002,
pp. 594–597.
[21] Engels, G., and Erwig, M. ClassSheets: Automatic Generation of Spread-
sheet Applications from Object-Oriented Specifications. In ASE ’05: Proceed-
ings of the 20th IEEE/ACM international Conference on Automated Software
Engineering (New York, NY, USA, 2005), ACM, pp. 124–133.

98
[22] Enright, A. J., Van Dongen, S., and Ouzounis, C. A. An Efficient Al-
gorithm for Large-Scale Detection of Protein Families. Nucleic Acids Research
30, 7 (2002), 1575–1584.

[23] Erwig, M., Abraham, R., Cooperstein, I., and Kollmansberger, S.


Automatic Generation and Maintenance of Correct Spreadsheets. In Proceed-
ings of the 27th IEEE/ACM International Conference on Software Engineering
(2005), pp. 136–145.

[24] Fisher, M., Cao, M., Rothermel, G., Cook, C. R., and Burnett,
M. Automated Test Case Generation for Spreadsheets. In Proceedings of the
24th International Conference on Software Engineering (New York, NY, USA,
2002), ACM, pp. 141–153.

[25] Fisher, M., and Rothermel, G. The EUSES Spreadsheet Corpus: a shared
resource for supporting experimentation with spreadsheet dependability mech-
anisms. SIGSOFT Software Engineering Notes 30, 4 (2005), 1–5.

[26] Forrester, D., Kobourov, S. G., Navabi, A., Wampler, K., and
Yee, G. V. Graphael: A System for Generalized Force-Directed Layouts. In
Graph Drawing, J. Pach, Ed., vol. 3383 of Lecture Notes in Computer Science.
Springer, 2004, pp. 454–464.

[27] Galletta, D. F., Hartzel, K. S., Johnson, S., Joseph, J., and
Rustagi, S. An Experimental Study of Spreadsheet Presentation and Error
Detection. In HICSS ’96: Proceedings of the 29th Hawaii International Confer-
ence on System Sciences (HICSS) Volume 2: Decision Support and Knowledge-
Based Systems (Washington, DC, USA, 1996), IEEE Computer Society, p. 336.

[28] Gansner, E. R., Koren, Y., and North, S. C. Topological Fisheye


Views for Visualizing Large Graphs. IEEE Transactions on Visualization and
Computer Graphics 11, 4 (2005), 457–468.

[29] Godehardt, E. Graphs as Structural Models – 2nd Edition. In Advances in


Systems Analysis, D. P. F. Moller, Ed. Viewg, Germany, 1990.

[30] Graphael Home Page. URL: http://graphael.cs.arizona.edu/, Access


date: 1st August, 2007.

[31] GraphViz Home Page. URL: http://www.graphviz.org/, Access date: 1st


August, 2007.

[32] Hendry, D., and Green, T. Creating, Comprehending and Explaining


Spreadsheets: A Cognitive Interpretation of What Discretionary Users Think
of the Spreadsheet Model. International Journal of Human-Computer Studies
40, 6 (June 1994), 1033–1065.

[33] Himsolt, M. GML: a portable Graph File Format. Tech. rep., Univer-
sitt Passau, 94030 Passau, Germany, 1996. URL: http://www.infosun.

99
fim.uni-passau.de/Graphlet/GML/gml-tr.html, Access date: 10th August,
2007.

[34] Igarashi, T., Mackinlay, J. D., Chang, B. W., and Zellweger, P. T.


Fluid Visualization of Spreadsheet Structures. In Proceedings of the IEEE Sym-
posium on Visual Languages (1998), pp. 118–125.

[35] Kankuzi, B., and Ayalew, Y. A Dynamic Graph-Based Visualization


for Spreadsheets. In Proceedings of the 3rd IASTED Conference on Human-
Computer Interaction (Innsbruck, Austria, 2008), pp. 198–203.

[36] Kankuzi, B., and Ayalew, Y. A User-Centered Graph-Based Visualization


for Spreadsheets. In Proceedings of the 4th International Workshop on End-User
Software Engineering (WEUSE ’08) (Leipzig, Germany, 2008), ACM Press,
pp. 86–90.

[37] Ko, A. J. Barriers to Successful End-User Programming. In End-User Software


Engineering, M. H. Burnett, G. Engels, B. A. Myers, and G. Rothermel, Eds.,
no. 07081 in Dagstuhl Seminar Proceedings. Internationales Begegnungs- und
Forschungszentrum fuer Informatik (IBFI), Schloss Dagstuhl, Germany, 2007.

[38] Mittermeir, R., and Clermont, M. Finding High-Level Structures in


Spreadsheet Programs. In WCRE ’02: Proceedings of the Ninth Working Con-
ference on Reverse Engineering (WCRE’02) (Washington, DC, USA, 2002),
IEEE Computer Society, p. 221.

[39] Myers, B. A., Burnett, M. M., Wiedenbeck, S., and Ko, A. J. End
user software engineering: CHI 2007 special interest group meeting. In CHI ’07
Extended Abstracts on Human Factors in Computing Systems (New York, NY,
USA, 2007), ACM, pp. 2125–2128.

[40] Myers, B. A., Ko, A. J., and Burnett, M. M. Invited Research


Overview: End-User programming. In CHI ’06 Extended Abstracts on Human
factors in Computing Systems (New York, NY, USA, 2006), ACM, pp. 75–80.

[41] Nardi, B., and Miller, J. The Spreadsheet Interface: A Basis for End User
Programming. Hewlett-Packard, 1990.

[42] Nardi, B. A. A small matter of programming: perspectives on end-user com-


puting. The MIT Press, 1993.

[43] Panko, R. R. Ray Panko’s Spreadsheet Research website. URL: http:


//panko.shidler.hawaii.edu/SSR/index.htm, Accessed: 29th September,
2007.

[44] Panko, R. R. Applying Code Inspection to Spreadsheet Testing. Journal of


Management Systems 16, 2 (1999).

100
[45] Panko, R. R. Spreadsheet Errors: What We Know. What We Think We Can
Do. In Proceedings of the European Spreadsheet Risk Interest Group Symposium
(2000).

[46] Panko, R. R., and Sprague, R. H. Hitting the Wall: Errors in Developing
and Code Testing a Simple Spreadsheet Model. Decision Support Systems 22,
4 (1998).

[47] Pietriga, E. A Toolkit for Addressing HCI Issues in Visual Language Environ-
ments. IEEE Symposium on Visual Languages and Human-Centric Computing
(VL/HCC) 00 (2005), 145–152.

[48] Randolph, N., Morris, J., and Lee, G. A Generalised Spreadsheet Ver-
ification Methodology. In ACSC ’02: Proceedings of the Twenty-Fifth Aus-
tralasian Conference on Computer science (Darlinghurst, Australia, Australia,
2002), Australian Computer Society, Inc., pp. 215–222.

[49] Ronen, B., Palley, M. A., and Henry C. Lucas, J. Spreadsheet Analysis
and Design. Commun. ACM 32, 1 (1989), 84–93.

[50] Rothermel, G., Burnett, M., Li, L., Dupuis, C., and Sheretov,
A. A Methodology for Testing Spreadsheets. ACM Transactions on Software
Engineering and Methodology 10, 1 (2001), 110–147.

[51] Ruthruff, J. R., and Burnett, M. Six Challenges in Supporting End-


User Debugging. In Proceedings of the First Workshop on End-user Software
Engineering (WEUSE I) (New York, NY, USA, 2005), ACM, pp. 1–6.

[52] Ruthruff, J. R., Prabhakararao, S., Reichwein, J., Cook, C.,


Creswick, E., and Burnett, M. Interactive, Visual Fault Localization
Support for End-User Programmers. Tech. rep., School of Electrical Engineer-
ing and Computer Science, Oregon State University USA, 2004.

[53] Sajaniemi, J. Modeling Spreadsheet Audit: A Rigorous Approach to Auto-


matic Visualization. Journal of Visual Languages and Computing 11, 1 (2000),
49–82.

[54] Sajaniemi, J. Modeling Spreadsheet Audit: A Rigorous Approach to Auto-


matic Visualization. Journal of Visual Languages and Computing 11, 1 (2000),
49–82.

[55] Scaffidi, C., Shaw, M., and Myers, B. An Approach for Categorizing
End User Programmers to Guide Software Engineering Research. In WEUSE
I: Proceedings of the First Workshop on End-user Software Engineering (New
York, NY, USA, 2005), ACM, pp. 1–5.

[56] Segal, J. Two principles of end-user software engineering research. In Pro-


ceedings of the First Workshop on End-user Software Engineering (WEUSE I)
(New York, NY, USA, 2005), ACM, pp. 1–5.

101
[57] Seta, K., Ikeda, M., Kakusho, O., and Mizoguchi, R. Capturing a
Conceptual Model for End-User Programming: Task Ontology as a Static User
Model. In User Modeling: Proceedings of the Sixth International Conference,
UM97, A. Jameson, C. Paris, and C. Tasso, Eds. Springer Wien New York,
Vienna, New York, 1997, pp. 203–214.
[58] Sjoberg, D. I. K., Dyba, T., and Jorgensen, M. The Future of Empirical
Methods in Software Engineering Research. In FOSE ’07: 2007 Future of
Software Engineering (Washington, DC, USA, 2007), IEEE Computer Society,
pp. 358–378.
[59] Tichy, W. F. Should Computer Scientists Experiment More? IEEE Computer
31, 5 (1998), 32–40.
[60] Tollis, I. G. Graph Drawing and Information Visualization. ACM Computing
Surveys (1996), 19.
[61] van Dongen, S. MCL - an algorithm for clustering graphs. URL: http:
//micans.org/mcl/, Access date: 1st August, 2007.
[62] van Dongen, S. Graph Clustering by Flow Simulation. PhD thesis, Centre for
Mathematics and Computer Science, University of Utrecht, The Netherlands,
2000.
[63] Vemula, V. R., Ball, D., and Thorne, S. Towards a Spread-
sheet Engineering. In Proceedings of the 2006 European Spreadsheet
Risks Interest Group (2006). URL: http://www.eusprig.org/2006/
vemula-towards-spreadsheet-engineering.pdf, Accessed on 14th August,
2007.
[64] Vemuri, S., Sengupta, S., and Davis, J. S. Data Dependency Diagrams
for Spreadsheet Applications. In Proceedings of the 30th Annual 30th Southeast
Regional Conference (New York, NY, USA, 1992), ACM, pp. 467–470.
[65] Wang, Y., Carzaniga, A., and Wolf, A. L. Four Enhancements to Auto-
mated Distributed System Experimentation Methods. In ICSE ’08: Proceedings
of the 30th International Conference on Software Engineering (New York, NY,
USA, 2008), ACM, pp. 491–500.
[66] Wiggerts, T. A. Using Clustering Algorithms in Legacy Systems Remodu-
larization. In Proceedings of the Fourth Working Conference on Reverse Engi-
neering (WCRE ’97) (Washington, DC, USA, 1997), IEEE Computer Society,
pp. 33–43.
[67] Wilson, A., Burnett, M., Beckwith, L., Granatir, O., Casburn,
L., Cook, C., Durham, M., and Rothermel, G. Harnessing Curiosity to
Increase Correctness in End-User Programming. In Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems (New York, NY, USA,
2003), ACM, pp. 305–312.

102
Glossary

Spreadsheet-related Terminology

Spreadsheet system: Application software which allows computations to be de-

fined by cells and fomulas which reference to cells. A spreadsheet system usually

contains a two-dimensional grid of cells known as a spreadsheet as well as an ac-

companying programming language which allows the development of third party

applications to extend the generic functionality of a spreadsheet system. A popular

spreadsheet system is Microsoft Excel which comes with the programming language

Visual Basic for Applications (VBA).

Spreadsheet: usually a two-dimensional grid of cells comprised of rows and columns

where data is entered and calculations are specified through formula. In Microsoft

Excel, a spreadsheet is called a worksheet.

Cell: the intersection of a column and a row. One can enter text, a number or

a formula in a cell.

Spreadsheet program: We use spreadsheet program synonymously with spread-

sheet. See spreadsheet.

Spreadsheet model: a categorization of programming in which computations are

103
specified through cells and formulas.

Spreadsheet template: a spreadsheet containing standardized content and/or

formatting that one can use as a basis for developing other spreadsheets.

Cell formula: An entry that produces a calculated result,usually based on a ref-

erence to one or more cells. The results of a formula change if one changes the

contents of a cell referenced in the formula. An example formula in Microsoft Excel

would be cell A1 having the formula =B1+$C$2.

Graph-related Terminology

Graph: A graph G consists of two finite sets, V and E. Each element of V is

called a vertex or a node. Vertices are also known as nodes. The elements of set E

are called edges and these are unordered pairs of the vertices. For example, the set

V might be {1, 4, 7, 8, 9} and set E might be {{1, 4}, {4, 9}, {1, 8}, {4, 7}}. Together,

V and E are a graph G.

Connected graph: a graph is connected if every pair of nodes can be joined

by a path.

Tree: is a connected graph that contains no cycles. Graph-theoretic trees resemble

trees in nature in the sense that graph theoretic trees do not have cycles just as the

104
branches of trees in nature do not split and rejoin.

Cluster: a grouping of nodes (vertices) in a graph depending on some criteria

such as structural or geometric proximity

Fisheye view: A technique which allows one to view a picture or a diagram as

a whole at once while at the same time providing the ability to the viewer to see de-

tailed parts of the picture without losing the overall context of the picture. Fisheye

views are important in graph drawing because they enable the display of a complex

graph on a limited screen display of most computers.

Treemap: a visualization technique in which hierarchical information is displayed

within nested rectangles, with each level of nesting corresponding to a level of hi-

erarchical decomposition. Cluster trees of complex graphs may also be visualized

using treemaps.

105
Appendix A

Algorithm 2 The algorithm for the spreadsheet parser module


Require: a non-empty spreadsheet
1: open a GML graph definition text file
2: for all used cells in spreadsheet do
3: if cell has formula then
4: extract cell dependency information
5: write extracted cell dependencies to GML graph definition file
6: end if
7: end for
8: close GML graph definition file
9: invoke graph drawing software

Algorithm 3 The algorithm for the spreadsheet highlighter module


Require: a non-empty cluster member text file
1: open a cluster member text file
2: for all cluster members in text file do
3: determine cell address of cluster member
4: generate random colour for the cell
5: highlight cell background with the generated random colour
6: end for
7: close cluster member text file

106
Appendix B
A sample GML graph definition file:

graph [ directed 0
node [
id 1
label “F34 ” ]
node [
id 2
label “F35 ” ]
node [
id 3
label “F36 ” ]
node [
id 4
label “F37 ” ]
node [
id 5
label “F38 ” ]
node [
id 6
label “F39 ” ]
node [
id 7
label “F40=SUM(F34:F39)” ]
edge [
source 1
target 7 ]
edge [
source 2
target 7 ]
edge [
source 3
target 7 ]
edge [
source 4
target 7 ]
edge [
source 5
target 7 ]
edge [
source 6
target 7 ]
]

107

View publication stats

You might also like