Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
1 views56 pages

Unleashing Robotic Process Automation

This master thesis by Maria Gkotsopoulou explores the integration of Robotic Process Automation (RPA) with process mining to enhance the analysis phase of RPA projects. It identifies the lack of automation in selecting suitable processes for RPA and proposes a multi-dimensional indicator system to guide companies in prioritizing their RPA activities. The research aims to define methodologies for assessing process tasks and improving the overall efficiency of RPA implementations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views56 pages

Unleashing Robotic Process Automation

This master thesis by Maria Gkotsopoulou explores the integration of Robotic Process Automation (RPA) with process mining to enhance the analysis phase of RPA projects. It identifies the lack of automation in selecting suitable processes for RPA and proposes a multi-dimensional indicator system to guide companies in prioritizing their RPA activities. The research aims to define methodologies for assessing process tasks and improving the overall efficiency of RPA implementations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Unleashing Robotic Process Automation

Through Process Mining

Master Thesis

Supervisor
Author Josep Carmona Vargas
Maria Gkotsopoulou Department
Computer Science

A thesis presented for the degree of

Master in Innovation and Research in Informatics


Data Science

Date of Defense: 01-07-2020


Unleashing Robotic Process Automation Through
Process

Maria Gkotsopoulou

Abstract
Robotic Process Automation (RPA) is an umbrella term for tools that run on an
end user’s computer, emulating tasks previously executed through a user interface by
means of a software robot. Many studies have highlighted potential benefits of RPA,
by bridging artificial intelligence and business process management, it provides the
promise of robots as a virtual workforce that perform tasks leading to improvements
in efficiency for business processes. However, most commercial tools do not cover
the Analysis phase of the RPA lifecycle. The lack of automation in such a phase
is mainly reflected by the absence of technological solutions to look for the best
candidate processes of an organization to be automated. Based on process mining
techniques, we seek to define an automatable indicator system used to guide and
direct companies that seek to better prioritize their RPA activities.
Keywords: Process mining, Robotic Process Automation

2
Acknowledgements

I would like to express my gratitude to my academic supervisor, Prof. Josep Car-


mona for his timely and constructive feedback and comments. I would, also, like
to reflect that this work was done, in solitude, during the Covid-19 imposed lock-
down, which undoubtedly has left a mark on us all. This has been of period of
personal introspection, contemplation, as well as, hopefully growth. I would like to
envision that academic study, as well as, research will find themselves occupying a
most impactful role and place in the future.

3
Contents

1 Introduction 5
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Preliminaries 8
2.1 Mathematical Notations . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Algorithms for Event Data . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Trace clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Related Work 19

4 Indicators for the Automation Potential of Processes 24


4.1 Running Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Execution Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 Execution Frequency: case . . . . . . . . . . . . . . . . . . . . 27
4.3 Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Inverse Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Prior/follow variability . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Implementation 32
5.1 Code implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Evaluation 35
6.1 BPI Challenge 2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2 Artifical UI logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7 Conclusion 43

A Appendix 52
A.1 Code implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4
1 | Introduction

Organizational knowledge management (KM) positions knowledge as an organiza-


tional resource and emphasizes the importance of knowledge work, and, knowledge
worker productivity to achieve competitive advantages [82]. However, for the exe-
cution of a business process, workers currently spend substantial time dealing with
systemantic problems [56]. An example of a systemantic problem would be that the
typical automated operations system (including Enterprise Resourcing Planning,
Customer Relationship Management, e-commerce, and e-business solution systems)
is unable to complete a whole process, end-to-end. Despite a widespread use of
process-aware information systems (PAIS), companies still rely on employees to
trigger or terminate processes, adapt them to cope with different cases, or use their
outcomes for a variety of purposes [3].
Lightweight IT, is a socio-technical knowledge regime, driven by competent users’
need for solutions, enabled by the consumerization of digital technology, and realized
through innovation processes [24]. An example of lightweight IT is Robotic process
automation (RPA), that is an emerging field of business process automation [77].
RPA is a software based technology designed to automate processes by mimicking
and replicating the execution of tasks performed by humans in their application’s
user interface (UI). Since RPA operates at the UI level, rather than at the system
level, it allows one to apply automation without any changes in the underlying
information system [7].
Potential use cases for RPA are, among others, data transfers and processing of
high volume of data [40]. RPA is frequently implemented in standard back office
processes, with the idea that high volume and repetitive tasks can be automated
[31]. Enabling companies to free up resources and to reallocate them to activities
with a focus on creating business value and customer satisfaction, RPA can foster
the emergence of new work forms and drive organizational competitiveness in the
digital age [101]. In the research literature, a number of case studies have shown
that RPA technology can concretely lead to improvements in efficiency for business
processes (BPs) involving routine work in large companies, such as O2 and Vodafone
([8], [40], [59]). Although the benefits in cost savings can be significant [11], not all
business processes are suitable for their use [35].
RPA tools allow users to capture routines, consisting of interactions with one or
multiple applications, that are encoded as scripts and then executed by software
bots [66]. While RPA tools empower companies to automate routine work, they do
not address the question of which specific routines should be automated, and how
[66]. In their systematic mapping study, Enriquez et al. [35] demonstrated that only
a few of the software products available in the market, embrace partially the phase
of Analysis. In the context of the lifecycle of an RPA project, the Analysis phase
consists of analyzing and determining the viability of carrying out the automation of
a certain process [35]. Moreover, to date, there are no established methods to carry
out this phase [79]. On the other hand, several RPA projects fail to stay within
budget and time, and return of investment is usually not delivered as expected

5
Unleashing Robotic Process Automation Through Process Mining

[31]. This is often caused due to false notion of process complexity and lack of
transparency on how processes are being executed [54].

1.1. Motivation

Complex organizations operate on differentiated architectures, with processes that


differ with respect to various characteristics that influence a process’s automation
potential. To this extent, companies must obtain a detailed understanding of their
processes to ensure a sufficient economic value and the overall success of their RPA
initiatives [101]. Not only does an improvement, in the execution of the Analysis
phase of an RPA implementation project, has the potential of an economic reduction,
but also a temporal reduction of the whole project [35]. It is important to note,
though, that robots typically do not automate complete end-to-end processes, but
only sub-processes or certain tasks thereof. Consequently, processes, referring to
end-to-end or sub-processes, are distinguished from tasks, referring to individual
activities within a process [101].
Process mining focuses on collecting, analyzing, and interpreting execution data ex-
tracted from event logs of (PAIS) [96]. Event logs, contains information about the
events that are being executed for each instance of a process. It assumes that events
are recorded sequentially and refer to certain activities that together form cases [5].
Most notably, it offers techniques for factual driven analysis of business processes.
One of the most common techniques of process mining is process discovery, where
information from an event log is extracted to build a process model. These models
represent the as-is process within the business. Process mining provides companies
with the means to achieve various objectives, such as obtaining a better under-
standing and control of their processes, as well as, detecting different variations of a
process, or, finding the root causes of performance deficiencies [70]. So, by analyzing
event data insights into the execution of processes in reality are uncovered, such as
undesired process patterns, bottlenecks and compliance issues. Geyer-Klingeberg
et al. [40] describe process mining and RPA as complementary concepts. Specifi-
cally, they argue that process mining supports companies in capturing an end-to-end
view on their processes, identifying high-value automation candidates, quantifying
the economic value of corresponding initiatives, and eliminating costly and often
subjective manual process evaluations.
On the other hand, to date, the task of identifying which routines are good candi-
dates for being automated by RPA tools, is often performed by means of interviews,
walkthroughs, direct observation of workers, and analysis of documentation that
may be of poor quality and difficult to understand [7]. Even if, research has pro-
posed some high-level criteria to measure the suitability of processes for RPA ([14],
[38], [40]), nonetheless, these concepts being high-level, implies that they are not
operationalized. Enriquez et al. [35], from their in-depth analysis of 54 primary
studies, conclude that 53,57% of the total studies propose a theoretical one, while
41,07% do not present a validation of the proposal. Wanner et al. [101] examined
six case studies with a focus on challenges during RPA projects ([8], [11], [55], [59],
[87], [106]), and concluded that the success of RPA projects depends significantly on
the availability of a standardized method to quantify and compare the automation
potential of process tasks. They are the first to further propose a multi-dimensional

6 Chapter 1 Maria Gkotsopoulou


Unleashing Robotic Process Automation Through Process Mining

indicator system designed as a decision-support method to guide companies in plan-


ning and designing RPA initiatives in practice. Thus defining metrics that would
support the Analysis phase of the RPA lifecycle constitutes an interesting research
line to be considered by the academic community.

1.2. Research Questions

A set of research questions has been created in order to formalize what is needed to
address in order to fulfill the purposes of this study. The research questions are as
follows:
• RQ1: How to assess the suitability of a process task extracted from an event
log to support the Analysis phase of an RPA project?
• RQ2: How to define an objective and generalizable methodology to evaluate
the quantifiable decision support framework of process task selection in RPA?

1.3. Thesis Structure

In this section, we present an outline for the remainder of this thesis. In Chapter 2,
general definitions and concepts related to this thesis work are introduced. In Chap-
ter 3 related work be that as it may in form of case studies, prototypes or systems
related to identification, elicitation and design of the to-be-automated tasks in an
RPA project. In Chapter 4 the general framework including the set of indicators to
be used in the Analysis phase are presented. Chapter 5 makes a brief presentation
of the system design, as well as, the implementation decision taken. Chapter 6 con-
ducts evaluation both on real-life event data and on synthetic event data. Chapter
7 concludes this thesis and presents possible limitations.

Chapter 1 Maria Gkotsopoulou 7


2 | Preliminaries

In this chapter, we introduce multiple definitions and notations that we use in this
thesis. First, basic mathematical concepts are presented. Second, concepts related
to process mining are introduced. Finally, definitions about trace clustering are
presented.

2.1. Mathematical Notations

Definition 2.1.1 (Set). A set is a well-defined, possibly infinite, unsorted collection


of distinct objects. The objects that belong to a set are called elements and can be
of an arbitrary type. A set is usually denoted by a capital letter such as X, and it
can be described listing its elements between curly braces, e.g., X = {x1 , x2 , x3 } is
a set containing elements x1 , x2 , x3 .
Given two sets X, Y ,
• Set X = {x1 , x2 , . . . , xn } has n elements, and the cardinality of a set (number
of elements or size) is denoted by |X|. We use ∈ to express that an element
belongs to a set, e.g., x1 ∈ X .
• ∅ denotes the empty set.
• We can perform the following operations on sets: union (X ∪ Y = {x|x ∈
X ∨ x ∈ Y } ), intersection (X ∩ Y = {x|x ∈ X ∧ x ∈ Y } ), and difference
(X \ Y {x|x ∈ X ∧ x ∈/ Y } ).
• X ⊆ Y denotes X is a subset of Y ⇔ X ∩ Y = X. A set X is a subset of set
Y when all the elements of X are included in Y . A set X is a strict subset of
Y (X ⊂ Y ) if X is a subset of Y and they are not equal.
• P(X) denotes the power set of set X, i.e., P(X) = {Y |Y ⊆ X},i.e. the set
of all possible subsets of X, including X and the empty set (∅). For instance,
given X = {x1 , x2 }, then P(X) = {∅, {x1 }, {x2 }, {x1 , x2 }}
Set N = {1, 2, 3, . . . } denotes natural numbers, and N0 = N ∪ {0}
Definition 2.1.2 (Tuple). A tuple is a finite ordered list of elements. A tuple can be
expressed by listing its elements in order between parentheses, e.g., A = (a1 , a2 , a3 ).
Also, it can be expressed as belonging to the Cartesian product of other sets, e.g.,
A = B × C.
Given n arbitrary sets, i.e. X1 , . . . , Xn , we define the n-ary Cartesian product of
these n sets as X1 × . . . × Xn = {(x1 , . . . , xn )|∀1 ≤ i ≤ n(xi ∈ Xi )}. We refer to an
element in an n-ary Cartesian product as a n-ary tuple. In case n = 2, the Cartesian
product defines the set of all ordered pairs (x1 , x2 ) ∈ X1 × X2 . Given set X and
Cartesian product X1 × . . . × Xn , if ∀1 ≤ i ≤ n(Xi = X), we simply write X n .
Definition 2.1.3 (Functions). A function f from X to Y , maps an element in X
to an element in Y , and is expressed as f ∈ X → Y . The domain of f is X if for
every element in X, f is defined and maps it to a value in Y , i.e., dom(f ) = X ⇔

8
Unleashing Robotic Process Automation Through Process Mining

∀x ∈ X : f (x) ∈ Y . The range of f is the set of the images of all elements in the
domain, i.e., range(f ) = {f (x)|x ∈ dom(f )}.
A partial function g ∈ X 9 Y is a function whose domain is a subset of X, i.e.,
dom(g) ⊆ X , which means that g does not need to be defined for all values in X.
A binary function R ∈ X × X is referred to as an endorelation on X. For such
endorelations, the following properties are of interest in the context of this thesis.
• R is reflexive, if and only if,∀x ∈ X(xRx).
• R is irreflexive, if and only if ∀@x ∈ X(xRx).
• R is symmetric, if and only if, ∀x, y ∈ X(xRy ⇒ yRx)
• R is antisymmetric, if and only if, ∀x, y ∈ X(xRy ⇒ ¬yRx)
• R is transitive, if and only if, ∀x, y, z ∈ X(xRy ∧ yRz ⇒ xRz)
A relation ⊆ X × X, alternatively written (X, ⊆), is a partial order, if and only
if, it is reflexive, antisymmetric and transitive. A relation ≺⊆ X × X, alternatively
written (X, ≺), is a strict partial order, if and only if, it is irreflexive, antisymmetric
and transitive.
Definition 2.1.4 (Function Projection). Let f ∈ X 9 Y be a (partial) function
and Q ⊆ X. f Q is the function projected on Q : dom(f Q ) = dom(f ) ∩ Q and
f Q (x) = f (x) for x ∈ dom(f Q ) .
In some cases, we are interested in a particular element of a tuple. To this end we
define a projection function, i.e. given 1 ≤ i ≤ n, π n : X1 × . . . × Xn → Xi , s.t.
π i ((x1 , . . . , xi , . . . , xn )) = xi , e.g. for (x, y, z) ∈ X × Y × Z, we have π 1 ((x, y, z)) =
x, π 2 ((x, y, z)) = y, π 3 ((x, y, z)) = z. In the remainder of this thesis, we omit the
explicit surrounding braces of tuples, when applying a projection on top of them,
i.e. we write π 1 (x, y, z) rather than π 1 ((x, y, z)).
Definition 2.1.5 (Multi-set). A multiset (or bag) generalizes the concept of a set
and allows elements to have a multiplicity, i.e. degree of membership, exceeding
one.
A multiset b is a function b : X → N0 , which maps each element to its number of
occurrences in the multiset, i.e., b(x) denotes the number of times element x ∈ X
appears in b. Given a set X,
P B(X) := {b|b : X → N0 } denotes all possible multisets
over set X, with size |B|= x∈X B(x).
For example, given set X = {x1 , x2 }, [x21 , x2 ] is a multiset containing element x1 two
times and element x2 once. Note that sets are written using curly brackets, while
multisets are written using square brackets.
Given two multisets M ∈ A → N+ , M 0 ∈ B → N+ the notion of disjoint union is
the following (to ease the readability we assume that X(c) = 0 if c ∈
/ Dom(X) ):
]
M =X X 0 = {(A ∪ B) → N+ |∀c ∈ A ∪ B, M (c) = X(c), X 0 (c))}

Definition 2.1.6 (Partition). A partition of a set X is a set of nonempty subsets


of X s.t., every element x ∈ X is in exactly one of these subsets. Hence, a partition

Chapter 2 Maria Gkotsopoulou 9


Unleashing Robotic Process Automation Through Process Mining

is mutual exclusive (no element is in more than one subset) and jointly exhaustive
(every element is in one subset). Mathematically P(X) ⊆ P(X), s.t.
• ∪X 0 ∈P(X) X 0 = X, ∀X 0 ∈ P(X)(X 0 6= ∅)
• ∀X 0 6= X 00 ∈ P(X), X 0 ∩ X 00 = ∅
Definition 2.1.7 (Sequence). Sequences represent enumerated collections of ele-
ments which additionally, like multisets, allow its elements to appear multiple times.
However, within sequences we explicitly keep track of the order of an element.
A sequence of length n over set X is an enumerated ordered collection of elements,
which is defined as a function σ : {1, 2, . . . , n} → X. we write σ = hσ1 , σ2 , . . . , σn i
denoting a sequence of length n, i.e. |σ|= n.
• hi denotes the empty sequence.
• For a given set X, X ∗ is the set of all non-empty finite sequences over X plus
the empty sequence.
For example, X = {x1 , x2 }, then X ∗ = {hi, hi, hx1 i, hx2 i, hx1 , x2 i, hx1 , x2 , x1 i, . . . }
• σ1 · σ2 denotes the concatenation of sequences σ1 and σ2 , e.g., ha, b, ci · hd, ei =
ha, b, c, d, ei.
 
v1
Definition 2.1.8 (Vector). We define ~v ∈ Rn =  ...  s.t., v1 , . . . , vn ∈ R as n-
 
vn
dimensional column vector. And ~v ∈ R = [v1 , . . . , vn ] denotes the transpose of it,
T n

which is a row vector.


Definition 2.1.9 (Parikh Vector). Given a sequence σ ∈ X ∗ and x ∈ X, we write
x ∈∗ σ if and only if ∃1 ≤ i ≤ |σ|(σ(i) = x). Furthermore, we define elem :
X ∗ → P(X) with elem(σ) = {x ∈ X|x ∈∗ σ} We additionally define the Parikh
abstraction, which counts the multiplicity of a certain element within a sequence,
i.e. parikh: X ∗ → B(X) , where: Given σ ∈ X, P~σ denotes its Parikh abstraction.
 !
|σ|
(
X 1 if σ(i) = x 
P~σ = xn |x ∈ X ∧ n =
i=1
0 otherwise

2.2. Algorithms for Event Data

Definition 2.2.1 (Universes). We define the following universes:


• E denotes the universe of events,
• AN denotes the set of all possible attribute names,
• V denotes the set of all possible attribute values,
• C denotes the universe of cases,
• A denotes the universe of activities,
• A denotes the set of all possible activity names,

10 Chapter 2 Maria Gkotsopoulou


Unleashing Robotic Process Automation Through Process Mining

• T denotes the set of all possible timestamps


Definition 2.2.2 (Event, Attribute, Event Projections, [6]). Let E be the set of all
possible event identifiers, and AN the set of all possible attribute names. Events
can be described by several attributes, e.g. time of occurrence, resource involved,
name of the performed activity, etc. Given an event e ∈ E and a name n ∈ AN ,
we say that #n (e) is the value of attribute n for event e. If event e does not have
attribute n, then #n (e) = ⊥ (null value).
We define some shorthands for the most common, although optional, event at-
tributes:
• #activity (e) is the activity associated to event e,
• #time (e) is the timestamp of event e; we define ē as a shorthand
• #resource (e) is the resource associated to event e,
• #type (e) is the transaction type, also known as life-cycle, associated to event
e.
Some conventions apply for these attributes. In particular, timestamps should be
non-descending in the event log. This allows us to order the events recorded into a
sequence. Also, we assume the timestamp universe T such that #time (e) ∈ T for
any e ∈ E.
Given an event e ∈ E , we assume the existence of the following projection functions
[111]:
• Case projection:
a case identifier c ∈ C uniquely identifies the process instance to which the
event belongs. We define the corresponding projection function πc : E → C .
• Activity projection:
an activity a ∈ A describes a well defined activity within a process. We define
the corresponding projection function πa : E → A .
Definition 2.2.3 (Case, Trace). Let C denote the universe of cases. Cases have
attributes. For any case c ∈ C and name n ∈ AN : #n (c) is the value of attribute n
for case c. If case c does not have attribute n, then #n (c) = ⊥ [6].
Similarly, given c ∈ C, we can define a case attribute projection [111] function πn :
C → V maps a case to corresponding case attribute value, where πn (c) represents
the data value of case c for attribute n, i.e. ∀c ∈ C, ∃n ∈ AN, v ∈ V , s.t., πn (c) = v.
A trace is a mandatory attribute of a case. A case trace projection function πt maps
a case to the corresponding trace, πt (c) ∈ A∗ represents the trace of case with case
identifier c ∈ C [26].
Let L ∈ E ∗ be an event log. A trace, related to a (case) as identified by case identifier
c ∈ C , is a sequence σ ∈ E ∗ for which:
1. σ ⊆∗ L; Traces are subsequences of event logs.
2. elem(σ) = {e ∈ L|πc (e) = c}; The trace contains all events related to c
3. 1 ≤ i ≤ j ≤ |σ|: σ(i) 6= σ(j); each event appears only once.

Chapter 2 Maria Gkotsopoulou 11


Unleashing Robotic Process Automation Through Process Mining

Note that the same trace may appear multiple times in an event log.
Definition 2.2.4 (Event log). An event log is a set of cases L ⊆ C such that each
event occurs only once, i.e., for any two cases c1 , c2 ∈ C, c1 6= c2 : Ec1 ∩ Ec2 = ∅.
Alternatively, L ∈ P(C), s.t.∀c ∈ L, πt (c) 6= hi.
If activity projection is used, then a trace corresponds to a sequence of activities.
In such a scenario, two or more cases can correspond to the same activity sequence.
Therefore, an event log is a multi-set of traces.
A trace σ ∈ E ∗ is sequence of events. Let Σ = E ∗ be the universe of traces. An
event log L is a multiset of traces, i.e. L ∈ B(Σ).
Definition 2.2.5 (Simple Log, Trace Variant). A simple event log EL ∈ B(A∗ )
represents the control-flow perspective of event log L. This view is achieved by
projecting each trace in an event log on its corresponding activities, i.e. given σ ∈ L
we apply πa∗ (σ). Thus we transform each trace within the event log into a sequence
of activities. ]
EL = πa∗ (σ)
σ∈traces(L)

where traces(L) denotes the collection of the traces described by the event log.
Thus a simple event log is a multiset of sequences of activities, as multiple traces
of events are able to map to the same sequence of activities. As such, each member
of a simple event log is referred to as a trace, yet each sequence σ ∈ A∗ for which
EL(σ) > 0 is referred to as a trace-variant. Hence, in case we have EL(σ) = k,
there are k traces describing trace-variant σ, and, the cardinality of trace-variant σ
is k.
The goal of process mining is to automatically produce process models that accu-
rately describe processes by considering only an organization’s records of its opera-
tional processes [22]. Such records are typically captured in the form of event logs,
consisting of cases and events related to these cases. Using these event logs pro-
cess models can be discovered. The task of constructing a process model from an
event log is termed process discovery [5]. Many such process discovery techniques
have been developed, producing process models in various forms, such as Petri nets,
BPMN-models, EPCs, YAWL-models [22].
Definition 2.2.6 (Petri Nets, Workflow Nets and Block-structured Workflow Nets
[63]). A Petri net is a bipartite graph containing places and transitions, intercon-
nected by directed arcs. A transition models a process activity, places and arcs
model the ordering of process activities. We assume the standard semantics of Petri
nets here, see [78]. A workflow net is a Petri net having a single start place and a
single end place, modeling the start and end state of a process. Moreover, all nodes
are on a path from start to end [5]. A block-structured workflow net is a hierarchical
workflow net that can be divided recursively into parts having single entry and exit
points.
Model of Figure 2.1 describes the behaviors of users rating an app [17]. First, users
start the form (s). They give either a good (g) or a bad (b) mark attached to a
comment (c) or a file (f ). Bad ratings get apologies (a), a silent transition (τ )
enables to avoid them. Finally, users can donate to the developers of the app (d).

12 Chapter 2 Maria Gkotsopoulou


Unleashing Robotic Process Automation Through Process Mining

Figure 2.1: Petri Net [17]

The ordering of events within a case is relevant, while the ordering of events among
cases is of no importance. Logs are analyzed for causal dependencies, e.g., if a task
is always followed by another task, it is likely that there is a causal relation between
both tasks. To analyze these relations, the following notations is used.
Definition 2.2.7 (Log-based ordering relations [5]). Let L be an event log over A,
i.e. L ∈ B(A∗ ). Let a, b ∈ A:
• a >L b if and only if there is a trace σ = ht1 , t2 , . . . , tn i and i ∈ {1, . . . , n − 1}
such that σ ∈ L and ti = a and ti+1 = b
• a →L b if and only if a >L b and b ≯L a
• a#L b if and only if a ≯L b and b ≯L a
• akL b if and only if a >L b and b >L a
Process discovery is challenging because the derived model has to be fitting, pre-
cise, general, and simple [90]. We give a general definition of the process discovery
algorithm.
Definition 2.2.8 (Process Discovery Algorithm [73]). Let L be an event log over
activities A and M be the universe of process models. Then process discovery
algorithm D is a function that takes event log as an input and returns a process
model over activities A , i.e., D : L → M
The majority of existing conventional process discovery algorithms share a common
underlying algorithmic mechanism. As a first step, the event log is transformed
into a data abstraction of the input event log on the basis of which they discover a
process model. A commonly used data abstraction is the directly follows relation.
Numerous discovery algorithms use it as a primary/supporting artefact, to discover
a process model [111]. Directly follows relation defines a multiset that contains all
direct precedence relations among the different activities present in the event log
[26]. We write (a, b) or a > b if activity a is directly followed by activity b.
Definition 2.2.9 (Direct Follows Relation [111]). Let L ∈ B(A∗ ) be an event log,
the direct follows relation>L , is the multiset A × A,i.e., >L ∈ B(A × A), for which
given activity a1 , a2 ∈ A:
n
X
>L (a1 , a2 ) = |{1 ≤ i ≤ |σ|−1|σ(i) = a1 ∧ σ(i + 1) = a2 }|
σ∈L

Chapter 2 Maria Gkotsopoulou 13


Unleashing Robotic Process Automation Through Process Mining

where >L denotes the multiset of all possible directly follows relations in event log
L and >L (a1 , a2 ) denotes the number of occurrence of directly follows relations
(a1 , a2 ) in the log.
Process discovery algorithms such as the Inductive Miner [63], the Heuristics Miner
[104], [105], the Fuzzy Miner [45], and most of the commercial process mining tools
use (amongst others) the directly follows relation as an intermediate structure [111].
As per [63] the directly-follows relation can be expressed in the directly-follows graph
of a log L, written G(L). It is a directed graph containing as nodes the activities
of L. An edge (a, b) is present in G(L) if and only if some trace h. . . a, b . . . i exists
in L. A node of G(L) is a start node if its activity is in Start(L), with definition
Start(G(L)) = Start(L). Similarly for end nodes in End(L), and End(G(L)). The
definition for G(M ) is similar. Start(L), Start(M ) and End(L), End(M ) denote the
sets of activities with which log L and model M start or end.
Definition 2.2.10 (Process tree[63]). A process tree is a compact abstract rep-
resentation of a block-structured workflow net: a rooted tree in which leaves are
labeled with activities and all other nodes are labeled with operators. A process
tree describes a language, an operator describes how the languages of its subtrees
are to be combined.
Assume a finite alphabet A of activities and a set ⊗ of process tree operators to be
given. Symbol τ ∈ A denotes the silent activity.
– α with α ∈ A ∪ τ is a process tree;
– Let M1 , . . . , Mn with n > 0 be process trees and let and ⊗ be a process tree
operator, then ⊗(M1 , . . . , Mn ) is a process tree
Operator × means the exclusive choice between one of the subtrees, → means the
sequential execution of all subtrees, means the structured loop of loop body M1
and alternative loop back paths M2 , . . . , Mn , and ∧ means a parallel (interleaved)
execution.
Example process trees and their languages [62]:

L(→ (a, ×(b, c))) = {ha, bi, ha, ci}

L(∧(a, b)) = {ha, bi, hb, ai}


L(∧(a, → (b, c)) = {ha, b, ci, hb, a, ci, hb, c, ai}
L( (a, b)) = {hai, ha, b, ai, ha, b, a, b, ai, . . . }

Process tree T describes a sequence of a choice between a and a silent activity τ ,


followed by activity b, and then c

Figure 2.2: Process tree T[84]

14 Chapter 2 Maria Gkotsopoulou


Unleashing Robotic Process Automation Through Process Mining

The idea for the Inductive Miner [63] algorithm is to find in G(L) structures that
indicate the ’dominant’ operator that orders the behaviour. Each of the four op-
erators ×, →, , ∧ has a characteristic pattern in G(L) that can be identified by
finding a partitioning of the nodes of G(L) into n sets of nodes with characteristic
edges in between [63]. Given a set ⊗ of process tree operators, Leemans et al. [63]
define a framework B to discover a set of process models using a divide and conquer
approach. Given log L, B searches for possible splits of L, such that these logs
combined with an operator ⊗ can produce L again. It then recurses on the found
divisions and returns a cartesian product of the found models.

2.2.1. Trace clustering

Process discovery is a largely unsupervised learning task in nature due to the fact
that event logs rarely contain negative events to record that a particular activity
could not have taken place [103]. Various studies illustrate that process discovery
techniques experience difficulties to render accurate and interpretable process models
out of event logs stemming from highly flexible environments [20], [41], [46], [98].
Since flexible environments environments typically allow for a wide spectrum of
potential behavior, the analysis results are equally unstructured [88].

Different approaches have been proposed to cope with the issue of high variety of
behavior that is captured in certain event logs. Next to event log filtering, event log
transformation [19] and tailor-made discovery techniques such as Fuzzy Mining [45],
trace clustering can be considered a versatile solution for reducing the complexity
of the learning task at hand [103]. The Fuzzy Miner [45] is a process discovery
technique developed to deal with complex and flexible process models. It connects
nodes that represent activities with edges indicating follows relations, taking into
account the relative significance of follows/precedes relations and allowing the user
to filter out edges using a slider. However, the process models obtained using the
Fuzzy Miner lack formal semantics [92]. If the event log shows a lot of behavioral
variability, the process model discovered for all traces together is too complex to
comprehend; in those cases, it makes sense to split the event log into clusters and
apply discovering techniques to the log clusters, thus obtaining a simpler model for
each cluster [68].

By dividing traces into different groups, process discovery techniques can be applied
on subsets of behavior and thus improve the accuracy and comprehensibility [103].
Given this diversity of an event log, it is a valid assumption that there are a number of
tacit process variants hidden within one event log, each significantly more structured
than the complete process [88]. Tacit processes will usually not be explicitly known,
however, the similarity of cases can be measured and used to divide the set of cases
into more homogeneous subsets.

Several techniques have been proposed in the last decade for trace clustering [20],
[21], [37], [43], [49], [88], [103]. In literature [17] they are partitioned into vector
space approaches [43], [88], context aware approaches [20], [21] and model-based
approaches [37], [49], [103]. Feature vector approaches require one to define feature
sets and transform traces in the event log to vectors representing the feature sets.

Chapter 2 Maria Gkotsopoulou 15


Unleashing Robotic Process Automation Through Process Mining

Figure 2.3: Using trace clustering in business process model discovery [89]

Definition 2.2.11 (Trace Clustering [17]). Given a log L, a trace clustering over L
is a partition over a (possibly proper) subset of the traces in L.
Feature sets describe the different attributes recorded in each case. Song et al.
[88] defined feature sets regarding various perspectives of process instances, i.e.,
the control-flow, resource, organization, and time perspectives. In their approach
traces are characterized by profiles, where a profile is a set of related items which
describe the trace from a specific perspective. Therefore, a profile with n items can
be considered as a function, which assigns to a trace a vector hi1 , i2 , . . . , in i. These
resulting vectors can subsequently be used to calculate the distance between any two
traces, using a distance metric [88]. To this extent, the activity profile defines one
item per type of activity (i.e., event name) found in the log. Measuring an activity
item is performed by simply counting all events of a trace, which have that activity’s
name. Moreover, the transition profile could be considered, in which case the items
in this profile are direct following relations of the trace. For any combination of two
activity names hA, Bi, this profile contains an item measuring how often an event
with name A has been directly followed by another event name B. The distance
between two of such vectors can be calculated using a distance function, e.g., the
Euclidean, Hamming or Jaccard distance [88].
Definition 2.2.12 (Distance Measures). The profile can be represented as a n-
dimensional vector where n indicates the number of items extracted from the process
log. Thus, case cj corresponds to the vector hij1 , ij2 , . . . , ijn i, where each ijk denotes
the number of appearance of item k in the case j [88]. Then, the Euclidean distance
[33], Hamming distance [48], and Jaccard distance [91] are defined as follows.
– Euclidean distance (cj , ck ) =
pPn
2
l=1 |ijl − ikl |
P δ(i ,i )
– Hamming distance (cj , ck ) = nl=1 jln kl where
0, if (x > 0 ∧ y > 0) ∨ (x = y = 0)

δ(x, y) = (2.1)
1, otherwise

16 Chapter 2 Maria Gkotsopoulou


Unleashing Robotic Process Automation Through Process Mining

Pn
l=1 ijl iklP
– Jaccard distance (cj , ck ) = 1 − Pn 2
Pn 2 n
l=1 ijl + l=1 ikl + l=1 ijl ikl

Many algorithms exist to perform clustering, e.g., k-means [74], agglomerative /


divisive hierarchical clustering [29], DBScan [36], OPTICS [10], hierarchical DBScan
[25]. Some of these algorithms return strict groups of objects while others return
hierarchies of objects.

Figure 2.4: Trace clustering using SOM algorithm [109]

Similar to Yu et. al. [110] we chose DBScan. The advantages of DBScan are that it
does not require one to specify the number of clusters as opposed to K-Means. In
addition, DBScan can find arbitrarily shaped clusters, whereas this is convex for K-
Means. Lastly, DBScan has a notion of noise since it can detect and ignore outliers
from the dataset.
Definition 2.2.13 (DBScan [100]). The DBScan (density-based spatial clustering
with added noise) problem takes as input n points P = {p1 , . . . , pn }, a distance
function d, and two parameters  and minPts [36]. A point p is a core point if and
only if |{pi |pi ∈ P, d(p, pi ) ≤ }|≥ minPts. We denote the set of core points as C.
DBScan computes and outputs subsets of P, referred to as clusters. Each point in C
is in exactly one cluster, and two points p, q ∈ C are in the same cluster if and only if
there exists a list of points p̄1 = p, p̄2 , . . . , p̄k−1 , p̄k = q in C such that d(p̄i−1 , p̄i ) ≤ .
For all non-core points p ∈ P \ C, p belongs to cluster Ci if d(p, q) ≤  for at least
one point q ∈ C ∩ Ci . Note that a non-core point can belong to multiple clusters.

Chapter 2 Maria Gkotsopoulou 17


Unleashing Robotic Process Automation Through Process Mining

A non-core point belonging to at least one cluster is called a border point and a
non-core point belonging to no clusters is called a noise point. For a given set of
points and parameters  and minPts, the clusters returned are unique.
The disadvantages of DBScan are that it is sensitive to parameter choice. If 
is too high, dense clusters are merged together [52]. In addition, DBScan is not
deterministic, as opposed to K-Means. Border points that are density-reachable
from more than cluster can be part of either cluster depending on the order of the
data is processed [36].
In summary, the approach used in this study consists in abstracting the features
of the traces from event logs into profiles such as the activity profile, case or event
attributes profile. Then, trace clustering can be applied using the DBScan algorithm
and an appropriate distance measure. To this extent, trace clustering techniques
divide the raw event log into Lv sublogs, |v| partitions of the event log L as per the
number of clusters obtained from the DBScan execution, where each sublog contains
the traces with similar behaviors.

18 Chapter 2 Maria Gkotsopoulou


3 | Related Work

Business process management [34] (BPM) revolves around the effective scheduling,
orchestration and coordination of the different activities and tasks that comprise a
(business) process. Nowadays, Business processes (BPs) are enacted in many com-
plex industrial (e.g., manufacturing, logistics, retail) and non-industrial (e.g., emer-
gency management, healthcare, smart environments) domains through a dedicated
generation of information systems, called Process Management Systems (PMSs) [81].
Organizations, currently faced with the challenge of keeping up with the increasing
digitization, seek to adapt existing business models and to respectively improve
the automation of their business processes [69]. Within BPM, the challenge of ac-
curately automating a business process is known as Business Process Automation
(BPA) [86]. Techniques that allow us to apply automation on the user interface (UI)
level of computer system have been developed for decades, though, it’s only recently
that they were adopted Robotic Process Automation (RPA) [93]. Enriquez et al.
[35] put forward that RPA could be considered a process-oriented optimization and
management strategy with a clear multidisciplinary nature because this strategy
involves multiple stakeholders (Subject Matter Experts – SME –, Business Analysts
– BA –, Software Developer – SD –, etc.). The entry barrier of adopting RPA in
processes that are already in place, is lower compared to conventional BPA [107]
since they operate on the user interface level, rather than on the system level. Most
applications of RPA have been done for automating tasks of service business process,
like validating the sale of insurance premiums, generating utility bills, paying health
care insurance claims, keeping employee records up-to date, among others [57].
Two broad types of RPA use cases are generally distinguished [66]:
• attended: An attended bot assists a user in performing her daily tasks, it can
be interrupted, paused or stopped at any time. They run on local machines
and manipulate the same applications as a user. They can be used to automate
routines that require dynamic inputs, human judgement or when the routine
is likely to have exceptions.
• unattended: Unattended bots are used for back-office functions. They usu-
ally run on organization’s servers and are suitable for executing deterministic
routines where all inputs and outputs are known in advance, and all executions
paths are well defined.
Telefonica comparison of RPA versus Business Process Management Systems (BPMS),
used in automation, reveals that RPA for 10 automated processes would pay back
in 10 months, in contrast, with the BPMS it was going to take up to three years to
payback [59]. In addition, a number of case studies, in companies such as O2 and
Vodafone, have shown that RPA technology can lead to improvements in efficiency
for BPs involving routine work ([8], [40], [59]).
Determining which process steps (also called routines) are good candidates to be
automated not only is the first task to be performed while conducting an RPA
project, but also it’s one of the key challenges ([7], [8], [35]). So far, most of the

19
Unleashing Robotic Process Automation Through Process Mining

research has focused on the establishment of criteria and guidelines ([14], [38], [40],
[57], as means to support organizations in addressing this challenge. According to
most criteria, the best candidates to be subjected to RPA projects are back office
areas [8], [40].
• Rule-based processes with high volume of manual tasks and handling time [40]
– Highly frequent tasks ([38], [58], [79])
– Low complexity of tasks [58]
– Degree to which the process is rule-based, Number of process steps [14]
– Repetitive process tasks with a high volume of transactions, sub-tasks,
and frequent interactions between different systems or interfaces [11]
– Execution Frequency (EF): count of each activity belonging to the
same process, Execution Time (ET): Average execution time of a pro-
cess task [101]
• Processes with fixed procedures, that are standardized and mature [40]
– Low level of exceptions, Involving an enclosed cognitive scope, Susceptible
to human errors [38]
– Processes that involve routine tasks, structured data and deterministic
outcomes [8]
– Degree of human intervention, Structuredness of data, Process standard-
ization, Degree of process maturity, Number of (known) exceptions [14]
– Streamlined process tasks with a-priori knowledge of possible events and
outcomes of process task executions [11], [108]
– Standardization (SD): number of different prior and following activi-
ties of each activity in a given process instance, Stability (ST): normal-
ized sum of the squared differences of the execution times of an activity
in a process instance and the average duration of that activity [101]
– Process tasks with a low probability of exception and a high predictability
of outcomes [9], [71]
• Automation Rate (AR) [101]; Process tasks with a small number of steps
that are already automated and offer less significant economic benefits [40]
– Degree of process digitization, Degree of similarity of environments, La-
bor intensity, Number of systems involved, Frequency of system related
changes, Risk-proneness [14]
– Failure Rate (FR): Throwbacks ratio of process tasks, i.e. unusual and
repetitive (partial) tasks until completion [101]
– Processes where a technical integration via the backend is too costly
and/or impossible [40]
Wanner et al. [101] system of indicators (highlighted in boldface) falls inline with
the current body of literature, as it was defined based on a literature analysis and

20 Chapter 3 Maria Gkotsopoulou


Unleashing Robotic Process Automation Through Process Mining

practical requirements of RPA case studies. Having identified a lack of comparability


and reproducibility, they suggest analyzing process execution data collected from
PAIS. Using these indicators to determine the automation potential of processes
is the first step of the method they define, previous to a pre-selection of relevant
processes. This pre-selection is based on the assumption that activities that have
identical start and end times are automated, while the rest is performed manually.
Specifically, they use these process features in event logs, to filter out these manually
performed tasks, and consider the rest for the calculation of the indicators. On the
other hand, the second step of their method, consists of analyzing the profitability
of process task automation, and finally, maximizing the economic value on RPA
projects. To achieve this, they formulate the problem of maximizing the economic
value of RPA as a zero-one knapsack problem [94].
The definition of these indicators lies in their analysis of the features in event log
data, that can be used for process task selection in RPA. However, they provide
definitions using high level notation, whereby event log E consists of a finite set of
processes Pi . Furthermore, processes are further composed of cases Cij , which repre-
sent process executions. Finally, cases describe a stream of activities Aijk that stand
for the tasks of a process, while each activity is associated with an activity type Ām .
Their indicators are defined at the process-activity level. They apply their method-
ology on process execution data from a Dutch financial services institute, provided
by the 4TU Centre for Research Data 1 . To operationalize the event log they first
draw upon the trace clustering approach as proposed by Song et al. [88] performing
process discovery; breaking down the event log into process tasks. Ultimately, their
method returns a list with quantified indicators and recommendations using a soft-
ware. However, they do not provide an evaluation of their framework and have not
made the generated software publicly available.
From an industry perspective, Gartner highlights in their recent Market Guide for
Process Mining, that process mining is a key enabler when it comes to RPA ini-
tiatives, since visualizing and understanding the process context as well as spotting
and prioritizing opportunities for task-level automation are huge success factors [53].
Geyer-Klingeberg et al. [40] also advocates for a synthesis of RPA and process min-
ing, as it will bring a decrease in time and cost of the robots’ training. Putting
forth as best practice standardization of processes prior to automation, the use of
process mining will provide insights about the root causes driving complexity. They
present a case study using a demo data set of the Purchase-to-Pay (P2P) process
obtained from a SAP ERP system, covering 395,315 purchase orders created over
one year. For each activity stored in the event log, they compute the automation
rate as ratio of cases where an activity was executed by a system user divided by the
total number of instances of this activity. They claim that experience shows that
starting RPA with processes exhibiting low automation rates can generate faster
benefits than increasing automation of process that are already highly automatized.
In addition, different robots can be trained by different process execution paths as
a means of testing which process variant is most effective posterior to process dis-
covery. Geyer-Klingeberg et al. make reference to the Vodafone case study findings,
supporting the role of process mining in alerting them that many orders with ex-
cessive throughput times rose due to complex deviations from the standard process
1
For more information, see https://data.4tu.nl/.

Chapter 3 Maria Gkotsopoulou 21


Unleashing Robotic Process Automation Through Process Mining

before they were sent out to the supplier [28]. Finally, since an RPA initiative is not
a one-time project but requires continuous monitoring of results process mining can
provide fast and powerful insights into RPA’s impact on process performance KPIs
such as throughput times [40].
On the other hand, most of the body of research does not consider the Analysis
phase; i.e. an analysis of the context to determine which processes – or parts of
them – are candidates to be robotized [79], in isolation, but in tandem with the
Design phase, that entails detailing the set of actions, data flow, activities, etc.,
that must be implemented in the RPA process [35]. Process mining techniques
currently expect as input process execution data (e.g. records of process activities
start and completion) whereas for RPA we need to use UI logs (clickstreams, keylogs)
as input [65], which is on a much lower level of granularity. To this extent, there
has been a shift from working on event logs to working on UI logs, when applying
process mining techniques. There has been development of techniques to analyze
User Interface (UI) logs in order to discover repetitive routines that are amenable
to automation via RPA tools ([18], [79]).
Discovering RPA routines is related to the problem of Automated Process Discovery
(APD) [12], which has been studied in the field of process mining. Jimenez-Ramirez
et al. transform screen, mouse and key events recordings of back-office staff using
information system (IS) into a standardized event log [44] including UI information
using a series of image-analysis algorithms [79]. Then, the generated UI log is
used as input of a process discovery algorithm to obtain a process model, which is
subsequently reviewed by a business analyst using the ProM interface [95]. Their
proposal, thus, represents a tool-supported method, with the potential of shortening
the time required to discover the relevant parts of processes that can be robotized,
since it facilitates the cleansing of the initial models from noise, by means of focusing
on the frequency of paths metric. They evaluated their method via a use case
application in a major Spanish bank and a telecommunication company, whereby
they compared it against the conventional analysis and design activities for RPA
scenarios that represents the a priori model, using measures such as # paths a priori
and # paths final. The fact that the basis of their evaluation is formed by these
two real-life processes considering a log regarding a single-user interaction presents
limitations.
In Bosco’s et al. [18] proposed method, an analysis of UI logs yields a discovery of
routines that are fully deterministic and hence automatable using RPA tools. Their
method consists of compressing the UI log into a Deterministic Acyclic Finite State
Automaton (DAFSA) [83] and subsequently, extract the flat-polygons (the candi-
date automatable routines), which represent actions sequences of different length.
The use of a lossless representation of event logs (DAFSAs) to discover candidate
automatable routines, as opposed to an automatically discovered process model is
justified by the fact that APD focuses on discovering control-flow models, without
data flow. Moreover, APD approaches generalize and under-approximate the log’s
behavior, in contrast, they seek to discover only sequences of actions that have been
observed in a UI log. Specifically, given a UI log, their method outputs a set of
routine specifications, consisting of an activation condition and a sequence of action
specifications. To this extent, in the discovered sequences of actions (routines), the
inputs of each action in the sequence (except the first one) can be derived from

22 Chapter 3 Maria Gkotsopoulou


Unleashing Robotic Process Automation Through Process Mining

the data observed in previous actions. To evaluate their method they generated
nine artificial UI logs designed with the CPN Tool [51], each log containing a differ-
ent number of automatable (sub)routines of varying complexity [13]. The only log
where their approach failed to discover routines is one that contains loops, due to
the fact that DAFSAs do not capture loops. The most significant limitations to their
technique is the fact that it can only discover perfectly sequential routines and it’s
inability to deal with noise, in which case, it will either discover only sub-routines
of an otherwise automatable routine, or not discover the routine at all.
Leno et al. [65] in accordance with the shift from working with event logs to working
on UI logs, while applying process mining techniques, aim to develop the techniques
for data-aware discovery of procedural and declarative process models, that can be
used for RPA bots training. This is because traditional APD techniques discover
control-flow models, while, in the context of RPA, executable specifications that
capture the mapping between the outputs and the inputs of the actions performed
during a routine, need to be discovered. Moreover, the discovered process model
needs to relate post-conditions of a task with pre-conditions, and more generally, it
needs to discover correlated conditions [65]. However, they make the case that there
is an absence of tools capable of recording UI logs that (1) can be given as input
to process mining tools and (2) contain information at the granularity level suitable
for RPA [67]. They specify that a UI log can describe multiple executions of a task;
while, one execution of a task is called task trace and it contains a sequence of actions
required to complete the task [66]. To this extent, they develop a tool, called Action
Logger, that meets functional requirements necessary to generate UI logs amenable
for further RPA-related analysis with process mining [67]. Furthermore, motivated
by the RPA use case of automating data transfers across multiple applications, they
addresses the problem of analyzing UI logs, in order to discover routines, where
a user transfers data from one spreadsheet or (Web) form to another [66]. These
routines can be discovered from the set of input-output examples induced by the
routine’s execution, and, can be codified as programs that transform a data structure
into another. Their empirical evaluation, done on a dataset that they build using
the Action Logger tool [67], demonstrates that the proposed technique can discover
various types of transformations. Nevertheless, it has various limitations; (1) it
assumes that the output fields are entirely derived from input fields, (2) it assumes
that the UI log consists of traces, such that each trace corresponds to a task execution
and (3) it’s unable to discover conditional behavior, where the condition depends
on the value of an input field.
Furthermore, the approach presented in Gao et al. [39] aims at extracting rules from
UI logs that can be used to automatically fill in forms. This approach, however,
does not discover any data transformations [66]. Moving away from both event and
UI logs, Leopold et al. [69] use as input data, textual process descriptions and
apply supervised machine learning and natural language processing techniques to
classify the process tasks as 1) manual task, 2) user task (interaction of a human
with an information system) or 3) automated task. They evaluate their method
using repeated 10-fold cross validation on a set of 424 activities from a total 47
textual process descriptions. Even they present Their proposal has the shortcoming
of analyzing what is available in the documentation instead of the actual behavior
of the system [79].

Chapter 3 Maria Gkotsopoulou 23


4 | Indicators for the Automation Poten-
tial of Processes

In our study, we will draw our attention solely to the Analysis phase, thus, focusing
on the identification of tasks and/or process properties that are potentially eligible
to be automated using RPA. To this extent, we will use event log data, in order
not to assume any additional economic, as well as time investment, associated to
the recording of a UI log, suitable for the application of process mining techniques.
Following Wanner et al. [101] approach, we set out to define indicators to guide
companies towards the process tasks where they should direct their efforts, when it
comes to undertaking an RPA project. However, we do not introduce the concept
of activity type as they do. The indicator of Execution Frequency and Stability are
otherwise similar to their definitions and as such their names are also maintained.
However, in the case of the Execution Time, even if the naming is the same, con-
ceptually there are differences, as in our case there is a transaction type dependency
and a consistent trace condition to be met. Moreover, the Execution Frequency:case
and the Prior/follow variability indicators originate, but depart from the concepts
introduced with the definitions of Failure Rate and Standardization, respectively.
Also, contrary to Wanner et al. [101] we will make use of formal notation, as per
the concepts defined in Chapter Preliminaries. Finally, as a log preprocessing step
we apply trace clustering resulting to obtaining Lv sublogs on which the indicators
are defined.

Figure 4.1: Approach overview

Once we obtain the indicators on each sublog-activity pair we need to apply a


methodology to determine the best candidate processes. Each sublog could be de-
fined as constituting a process, and so, consequently, we can apply an aggregating
function on each indicator, to obtain the overall score. However, to a great extent,
the indicators are defined on relative scale. Therefore, as per our statement of inten-
tion, the use of these indicators provides cues of automation potential. The objective
is that of directing the efforts of RPA towards a process and justifying the time and
economic investment associated to the undertaking of an RPA project. We seek
to address the gap that exists in terms of methodologies dealing with the Analysis
phase of the RPA lifecycle. We envision our approach falling into the initial steps
of the RPA pipeline, whereby posterior the Design and Construction phase would
need to be undertaken. That would entail obtaining event recordings in the form of
a UI log generated at a level suitable for extracting useful routines. The following
steps consist in determining the routine specifications in the form of control-flow

24
Unleashing Robotic Process Automation Through Process Mining

model enhanced with data flow and finally, generating an executable RPA script
that implements the specification [66].

4.1. Running Example

To facilitate the concept development and presentation, we will use a running exam-
ple throughout this Chapter. This running example is based on the BPI Challenge
2012 dataset [97]. This same dataset will be used for evaluation purposes, as well
(see Chapter Evaluation). The event log, used for the BPI Challenge 2012, contains
events related to the application process for a personal loan or overdraft within a
Dutch financial institute. The global process is defined over three sub-processes and
can be summarized as follows: a submitted loan/overdraft application is subjected
to some automatic checks, upon failure to pass it is declined. In addition, customers
are contacted by phone when additional information needs to be obtained, as well as,
for incomplete/missing information. The application can be declined upon assessing
the responses of eligible applicants to which offers were sent or without making any
offer. Posterior to the final assessment, the application is either approved and acti-
vated, declined, or cancelled. Furthermore, certain cases are considered suspicious
and a check signalling fraud is performed.

Event Type Meaning


States starting with ‘A_’ States of the application
States starting with ‘O_’ States of the offer belonging to the application
States starting with ‘W_’ States of the work item belonging to the application
COMPLETE The task (of type ‘A_’ or ‘O_’) is completed
The work item (of type ‘W_’) is created in the queue
SCHEDULE
(automatic step following manual actions)
START The work item (of type ‘W_’) is obtained by the resource
The work item (of type ‘W_’) is released by
COMPLETE the resource and put back in the queue
or transferred to another queue (SCHEDULE)

Table 4.1: Event type explanation

Each case contains a single case level attribute, AMOUNT_REQ, which indicates
the amount requested by the applicant. For each event, the extract shows the type
of event, life-cycle stage (Schedule, Start, Complete), a resource indicator and the
time of event completion.

Chapter 4 Maria Gkotsopoulou 25


Unleashing Robotic Process Automation Through Process Mining

REG_DATE concept:name AMOUNT_REQ


2011-10-01 09:45:25.806 173703 13500

Table 4.2: Illustrative Case

org:resource lifecycle:transition concept:name time:timestamp


112 COMPLETE A_SUBMITTED 2011-10-01 09:45:25.806
112 COMPLETE A_PARTLYSUBMITTED 2011-10-01 09:45:25.981
112 COMPLETE A_PREACCEPTED 2011-10-01 09:46:18.211
112 SCHEDULE W_Completeren aanvraag 2011-10-01 09:46:18.674
10912 START W_Completeren aanvraag 2011-10-01 11:37:46.748
10912 COMPLETE W_Completeren aanvraag 2011-10-01 11:40:24.141
10912 START W_Completeren aanvraag 2011-10-01 12:40:26.958
10912 COMPLETE A_CANCELLED 2011-10-01 13:02:11.068
10912 COMPLETE W_Completeren aanvraag 2011-10-01 13:02:12.557

Table 4.3: Events of Illustrative Case

As mentioned previously, as a log preprocessing step we apply trace clustering re-


sulting to obtaining Lv sublogs on which the indicators are defined. Consider simple
sublog EL10 as per below:
EL10 = [hA_SUBMITTED, A_PARTLYSUBMITTED, W_Afhandelen leads2 ,
W_Beoordelen fraude, W_Afhandelen leads, W_Beoordelen fraude3 ,
A_DECLINED, W_Beoordelen fraudei, hA_SUBMITTED,
A_PARTLYSUBMITTED, W_Afhandelen leads4 , W_Beoordelen fraude,
W_Afhandelen leads, W_Beoordelen fraude3 , A_DECLINED,
W_Beoordelen fraudei, hA_SUBMITTED, A_PARTLYSUBMITTED,
W_Afhandelen leads8 , A_CANCELLEDi, hA_SUBMITTED,
A_PARTLYSUBMITTED, W_Afhandelen leads2 , W_Beoordelen fraude,
W_Afhandelen leads, W_Beoordelen fraude3 , A_DECLINED,
W_Beoordelen fraudei, hA_SUBMITTED, A_PARTLYSUBMITTED,
W_Afhandelen leads2 , W_Beoordelen fraude, W_Afhandelen leads,
W_Beoordelen fraude3 , A_DECLINED, W_Beoordelen fraudei,
hA_SUBMITTED, A_PARTLYSUBMITTED, W_Afhandelen leads2 ,
W_Beoordelen fraude, W_Afhandelen leads, W_Beoordelen fraude3 ,
A_DECLINED, W_Beoordelen fraudei]

4.2. Execution Frequency

The importance of a task’s execution frequency for process selection in RPA is


widely supported by the literature. Clearly, the business case will be stronger for
those cases in terms of potential cost reductions and return on investment.
Definition 4.2.1 (Activity Vector [26]). Let A = {a1 , a2 , . . . , a|A| } be the set of
all activities in the event log L, the |A|-dimensional activity vector of one sublog

26 Chapter 4 Maria Gkotsopoulou


Unleashing Robotic Process Automation Through Process Mining

~ L ∈ N|A|
Lv ⊆ L, is defined as a function A 0 , given by
i

~ Lv = (|Lv |a1 , |Lv |a2 , . . . , |Lv |a ), or A


A ~ Lv (aj ) = |Lv |a
|A| j

where |Lv |aj denotes the number of occurrence of the activity aj in the sublog Lv
To illustrate, assume we have a simple log EL from which we obtain two sublogs
EL1 = [ha, bi, ha, b, ci] and EL2 = [ha, bi, ha, ci]. The set of all activities of the
complete event log L is {a, b, c}, so the activity vector for the sublog L1 is A~ L1 =
(2, 2, 1) and similarly A~ L2 = (2, 1, 1).

To this extent, Execution Frequency (EF (Lv )) is defined as the count of each activity
belonging to the same Lv sublog.

EF (Lv ) = |Lv |aj , ∀aj ∈ A, Lv ∈ P(L)

The feature vector regarding the set of all activities, named activity vector, is in the
form of the Parikh vector introduced in definition 2.1.9, defined here for illustration
purposes.

4.2.1. Execution Frequency: case


On the other hand, in addition to obtaining an overall view of Execution Frequency
we can obtain an estimate from a case perspective. Then, the |A|-dimensional
activity vector of trace σv is defined as
~ σv = (|σv |a1 , |σv |a2 , . . . , |σv |a ), or A
A ~ σv (aj ) = |σv |a
|A| j

where |σv |aj denotes the number of occurrences of the activity aj in the trace variant
σv . While, itself trace variant σv is each sequence σ ∈ A∗ for which Lv (σ) > 0.
We calculate, for each sublog Lv ⊆ L, the frequency of variants and activities for
all unique variants that the activity is present. Further on, we obtain the weighted
sum as per the below definition.

v
P
{σv ∈traces(Lv ))|aj ∈σv } (|σv |aj ×L (σv ))
v
EFc (L ) = P v
, ∀aj ∈ A, Lv ∈ P(L)
{σv ∈traces(Lv ))|aj ∈σv } L (σv )

Consider simple sublog EL10 introduced in section 4.1. It contains six traces and
three trace variants. The first trace variant appears 4 times, while the other two
just once.
σ0 = hA_SUBMITTED, A_PARTLYSUBMITTED, W_Afhandelen leads2 ,
W_Beoordelen fraude, W_Afhandelen leads, W_Beoordelen fraude3 ,
A_DECLINED, W_Beoordelen fraudei
σ1 = hA_SUBMITTED, A_PARTLYSUBMITTED, W_Afhandelen leads4 ,
W_Beoordelen fraude, W_Afhandelen leads, W_Beoordelen fraude3 ,
A_DECLINED, W_Beoordelen fraudei
σ2 = hA_SUBMITTED, A_PARTLYSUBMITTED,
W_Afhandelen leads8 , A_CANCELLEDi

Chapter 4 Maria Gkotsopoulou 27


Unleashing Robotic Process Automation Through Process Mining

Then, we obtain the activity count and variant count for each activity aj .

variant activity activity count variant count activity × variant


0 W_Beoordelen fraude 5 4 20
1 W_Beoordelen fraude 5 1 5
0 A_PARTLYSUBMITTED 1 4 4
1 A_PARTLYSUBMITTED 1 1 1
2 A_PARTLYSUBMITTED 1 1 1

Table 4.4: EFc (L10 ): calculation steps

Then for activity W_Beoordelen fraude we obtain


25
EFc (L10 ) = = 5, EF (L10 ) = 25
5
while for activity A_PARTLYSUBMITTED we obtain
6
EFc (L10 ) = = 1, EF (L10 ) = 6
6

One indicator is informing us of whether an activity is executed frequently, while


the other, of whether this activity is executed frequently across. Although activities
repeated often, i.e. those having high EF (Lv ) are candidates for RPA, those having
a high EFc (Lv ) should be investigated to understand the root causes of the repeats.
Wanner et al. put forward the hypothesis in their definition of Failure Rate that
during the process execution per case, activities are repeated to correct an error.

4.3. Execution Time

A trace is a sequence of events, denoting for a case what activities were executed.
As per the definition 2.2.2 in chapter 2, transaction type is an attribute associated
to event that refers to the life-cycle of activities. Examples are schedule, start,
complete and suspend [6]. From Figure 4.2, we see that for b only the completion
of the activity instance is recorded, while activity instance a has two events. On the
other hand, activity instance c is first scheduled for execution, then the activity is
started and finally it completes.

Figure 4.2: Transactional events for three activity instances

28 Chapter 4 Maria Gkotsopoulou


Unleashing Robotic Process Automation Through Process Mining

An activity instance is the execution of an activity in a trace, and may consist of a


start event and a completion event, as well as events of other transaction types [64].
Similar to [64]:

Definition 4.3.1 (Consistent trace [23]). A trace is consistent if and only if each
start event has a corresponding completion event and vice versa.

We can measure the duration of an activity instance by subtracting the timestamp


of the start event from that of the complete event, under the assumption that never
more than one instance of the given activity is concurrently being executed.

With E as the set of all possible event identifiers, as per the definition 2.2.1 in
Chapter 2, then it is partitioned into two sets [75]:

– E S = {e ∈ E|#type (e) = start} (i.e., all start events in L) and

– E C = {e ∈ E|#type (e) = complete} (i.e., all complete events in L)

Definition 4.3.2 (Service Time [75]). Given E,E S and E C , we define function st ∈
E → T maps events onto the duration of the corresponding activity, i.e., the service
time (the time a resource is busy with a task [64]). We assume that there is a one-
to-one correspondence between E S and E C i.e., any es ∈ E S corresponds to precisely
one event ec ∈ E C and vice versa. The service time of these events are equal, i.e.,
st(es ) = st(ec ) = ēc − ēs .

Execution Time (ET (Lv )) is the average duration of an activity aj belonging to the
same sublog Lv ⊆ L. We calculate it for each sublog Li by summing the service
time over all events belonging to an activity aj and diving it by the cardinality of
the set of events belonging to an activity aj . To this extent, only consistent trace
will yield a value higher than 0.

st{es ∈ ELSv |∃σ∈Lv es ∈ σ}


P
πaj (es )
v
ET (L ) = , ∀aj ∈ A, Lv ∈ P(L)
|ELSv {aj } |

org:resource lifecycle:transition concept:name time:timestamp


SCHEDULE W_Valideren aanvraag 2011-10-31 10:09:40.861
10609 START W_Valideren aanvraag 2011-11-01 14:55:48.028
10609 COMPLETE W_Valideren aanvraag 2011-11-01 15:20:02.844
10609 COMPLETE A_REGISTERED 2011-11-04 13:37:52.243

Table 4.5: sample Events of Activity instance of Case (176326)

As per the Table 4.5, for sublog L8 and for activity W_Valideren aanvraag the
service time is 1454.816, while for A_REGISTERED it is 0. Then, the Execution
time of W_Valideren aanvraag for sublog L8 is obtained by calculating the average
duration.

Chapter 4 Maria Gkotsopoulou 29


Unleashing Robotic Process Automation Through Process Mining

4.4. Inverse Stability

Similar to Wanner et al., we introduce the indicator of the Inverse Stability based
on the variance of its execution times, since our candidate tasks should be fixed and
standardized. A high Inverse Stability would signal non-deterministic outcomes and
low predictability. We measure the Inverse Stability (ST −1 ) of an activity based
on the sum of the squared differences of the execution times of an activity in an
execution and the average duration of that activity and normalizes the results based
on the number of activity executions.

∈ ELSv |∃σ∈Lv es ∈ σ} − ET (Lv ))2


P
πaj (es ) (st{es
ST −1 v
(L ) = , ∀aj ∈ A, Lv ∈ P(L)
EF (Lv ) × ET (Lv )

4.5. Prior/follow variability

Similar to Wanner et al. concept of Standardization (SD), we analyze each activ-


ity instance and get the number of different prior and following activities. The
variability of prior and following activities can signal that we are not dealing with
fixed procedures that are standardized and mature. In Chapter 2 we introduced the
definition 2.2.7 of log-based order relations.
Let L be an event log over A a, b ∈ A. |a >L b| is the number of times a is directly
followed by b in L, i.e.,
X
|a >L b|= L(σ) × |{1 ≤ i < |σ|| σ(i) = a ∧ σ(i + 1) = b}|
σ∈L

Then, we can define the Direct Succession Vector similar to definition 4.2.1 as follows.
Definition 4.5.1 (Direct Succession Vector [26]). Let L ∈⊆ B(A∗ ) be an event
log, >L be the multiset of all possible directly follows relations in event log L. Let
Lv ⊆ L, be one sublog and a1 , a2 ∈ A be two activities, the direct succession vector
of the sublog Lv is defined as function S~Lv ∈ N|A×A|
0 ,given by
X
S~Lv (a1 , a2 ) = |{1 ≤ i ≤ |σ|−1|σ(i) = a1 ∧ σ(i + 1) = a2 }|, ∀a1 , a2 ∈ A
σ∈Li

Let Lv ⊆ L, be one sublog then for any activity aj ∈ A. Then the directly follows
relations of sublog Lv projected for activity aj are denoted by >Lv {aj } and the direct
succession vector is S
~Lv  . Furthermore, activity aj can either be the first or the
{aj }

second element of the tuple. We split the directly follows relations and the direct
succession vector based on the aforementioned fact and denote them with S ~Lv  s
{a }
j

and SL  e respectively depending on whether activity aj is the first or the second


~ v
{a }
j
element. The Prior/follow variability (PFv(Lv )) indicator is defined as follows:
~Lv  s )
max(S max(S~Lv  e )
1 {a } 1 {a }
PFv(L ) = v
× P j
+ × P j
,
~Lv  s
kS >0k ~Lv  s )
(S ~Lv  e > 0 k
kS ~ Lv  e )
(S
{a } {a } {a } {a }
j j j j

∀aj ∈ A, L ∈ P(L)
v

30 Chapter 4 Maria Gkotsopoulou


Unleashing Robotic Process Automation Through Process Mining

with k ~v k as the length of a vector ~v . The first component relates to the following
activities after aj and the second relates to the prior activities before aj ; both range
from (0, 1) and consequently the sum ranges from (0, 2). The best candidates are
the activities with an indicator closer to 2.
To illustrate assume we have two simple sublogs

EL1 = [ha, bi, ha, b, ci, ha, ci, hb, ai], EL2 = [ha, bi, ha, ci]

of a certain simple event log EL. Then, the directly follows relations and the direct
succession vector are

>EL = [(a, b)3 , (a, c)2 , (b, a), (b, c)], ~EL1 = (2, 1, 1, 1),
S ~EL2 = (1, 1, 0, 0)
S

Consider activity a.

Split to S
~EL1  s = (2, 1, 0, 0) and S
{a }
~EL1  e = (0, 0, 1, 0) then
{a }

1 2 1 1 4
PFv(EL1 ) = × + × =
2 3 1 1 3

Consider now simple sublog EL10 introduced in section 4.1. We obtain the directly
follows relations and the direct succession vector:
>EL10 = [(A_DECLINED, (W_Beoordelen fraude)5 ,
(A_PARTLYSUBMITTED, (W_Afhandelen leads)6 ,
(A_SUBMITTED, (A_PARTLYSUBMITTED)6 ,
(W_Afhandelen leads, (A_CANCELLED),
(W_Afhandelen leads, (W_Beoordelen fraude)1 0,
(W_Beoordelen fraude, (A_DECLINED)5 ,
(W_Beoordelen fraude, (W_Afhandelen leads)5 ]

~EL10 = (5, 6, 6, 1, 10, 5, 5)


S
Consider activity W_Beoordelen fraude, denominated wbf .

Split to S
~EL10  s = (0, 0, 0, 0, 0, 5, 5) and S
{wbf }
~EL10  e = (5, 0, 0, 0, 10, 0, 0) then
{wbf }

1 5 1 10 7
PFv(EL10 ) = × + × =
2 10 2 15 12
On the other hand, for activity W_Afhandelen leads, denominated wal we obtain
a higher value.
~EL10  s = (0, 0, 0, 1, 10, 0, 0) and S
Split to S {wal }
~EL10  e = (0, 6, 0, 0, 0, 0, 5) then
{wal }

1 10 1 6 8
PFv(EL10 ) = × + × =
2 11 2 11 11

Chapter 4 Maria Gkotsopoulou 31


5 | Implementation

Process mining is supported by several open-source (ProM [32], RapidProM [4],


[102], Apromore [85], bupaR [50], PM4Py [16], PMLAB [27]) and commercial (Disco,
Celonis, ProcessGold, QPR ProcessAnalyzer, etc.) software. Apart from Rapid-
ProM (an extension of the data science framework RapidMiner) and Apromore, the
majority of the open-source projects provide a standalone tool that only allows to
import an event log and to perform process mining analyses on it. PM4Py and
bupaR provide a set of process mining features as a library, and this provides inte-
gration with the corresponding Python and R ecosystem [15].
The Process Mining for Python (PM4Py) framework supports various tasks of pro-
cess mining. PM4Py is available for installation in Python 3.6 through the command
pip install pm4py in Windows, Mac OS X and Linux environments. Additional pre-
requisites 1 , have to be installed (e.g.for Windows, GraphViz and Microsoft Visual
Studio 14 C++ compiler are required). In addition to that, the official Github repos-
itory 2 could be cloned to get access to stable and development branches. PM4Py
documentation is available online 3 . It is intended to be used in academic as well
as industrial projects. It is an open-source library that offers various tools and
algorithms facilitating process mining. The tools and algorithms are implemented
in Python 4 , a popular object orientated programming language commonly used
in data science as well as other computer science areas. The PM4Py-framework
is developed by the i9 computer science chair at RWTH Aachen University under
the lead of Professor Wil M. P. van der Aalst and was launched in early 2019 [16].
The choice of PM4Py was made because it allows for the easy integration of process
mining algorithms with algorithms from other data science fields, implemented in
various state-of-the-art Python packages [16]. Moreover, since it creates a collabo-
rative eco-system that easily allows researchers and practitioners to share valuable
code and results with the process mining world [16].
For storing event logs, the very first approach had been MXML which was intro-
duced in 2003 [1]. This format is based on XML. Although MXML was a successful
approach, it was difficult to extend with new data types. Therefore, XES was intro-
duced, which has a more flexible tag-based structure. Currently, XES is the main
standard format for storing logs which has been accepted by the "IEEE Task Force
on Process Mining" in 2010 [44]. An XES log is a set of traces where each trace
consists of a sequence of events. In addition to that, all events are sorted based on
their timestamps inside a trace. All events in a log contain three main attributes:
• Case ID: represented as a trace attribute concept:name is a unique value for
identifying a trace.
• Event name: represented as an event attribute concept:name is used for defin-
ing the activity associated to an event.
1
https://pm4py.fit.fraunhofer.de/install
2
https://github.com/pm4py/pm4py-source
3
https://pm4py.fit.fraunhofer.de/docs
4
http://python.org

32
Unleashing Robotic Process Automation Through Process Mining

• Timestamp: represented as an event attribute time:timestamp is an attribute


for identifying the timestamp of an event.
Depending on the log handler, the log is either persistently loaded in memory, or
loaded in memory when required. In the current version, two in-memory handlers
are provided: a XES handler (that loads an event log in the XES format, and uses
the PM4Py EventLog structure to store events and cases), and a CSV handler (that
loads an event log in the CSV/Parquet 5 format, and uses Pandas dataframes 6 ).

5.1. Code implementation

By default, our code takes an event log in XES format as input but it can be also
configured to accept event log in CSV format. Trace clustering is then performed
using pm4py-clustering. The algorithm that has been implemented is basilar:
• A one-hot encoding of the activities of the single events is obtained.
• A PCA is perform to reduce the number of components that are considered
by the clustering algorithm; using the scikit-learn implementation. [76]
• The DBScan clustering algorithm is applied in order to split the traces into
groups; using the scikit-learn implementation [76].
The log is transformed to a log representation, using the string event attribute
concept:name by default and an optionally provided the string trace attribute. As
trace clustering has not been the main focus of this work static values are used.
Specifically, clusters are obtained using DBScan, with  parameter equal to 0.3 and
MinPts equal to 5, applied to PCA projected data using 3 components.
In an exploratory setting we could analyze PCA explained variance and consider
choosing number of components based on it. In addition, since the presence of
outliers could be detrimental to the obtained principal components, instead we could
consider following a weighted approach. To this extent, a possible approach would
be to apply weighted PCA based on the eigenvalue decomposition of the weighted
covariance matrix following Delchambre [30]. The weights would be obtained by
first fitting a Minimum Covariance Determinant (MCD) robust estimator and then
getting the Mahalanobis distance. By using a robust estimator of covariance, there is
an associated guarantee that the estimation is resistant to “erroneous” observations
in the data set and that the associated Mahalanobis distances accurately reflect the
true organisation of the observations. However, this method can’t handle the case
when the estimated covariance matrix of the support data is equal to 0 and therefore
the determinant is equal to 0.
On the other hand, DBScan identifies noisy objects and is therefore resistant to
outliers. It requires two parameters to be initialized:
• MinPts: Determines the minimum number of points required to form a cluster.
• : represents the distance threshold between two points that can be considered
to be similar.
5
https://parquet.apache.org/
6
https://pandas.pydata.org/

Chapter 5 Maria Gkotsopoulou 33


Unleashing Robotic Process Automation Through Process Mining

A basic approach of how to determine the parameters  and MinPts is to look


at the behavior of the distance from a point to its kth nearest neighbor, which is
called k-dist. The k-dists are computed for all the data points for some k, sorted
in ascending order, and then plotted using the sorted values, as a result, a sharp
change is expected to see. The sharp change at the value of k-dist corresponds to a
suitable value of  [36], [72].
On the other hand, finding the optimal offset of  can also be achieved through
experiments. A typical way to aid the user is to provide suggestions for epsilon by
calculating the mean or median distance based on a sample set of available traces
[110]. Experiments on data from various domains have shown that taking 16 1
of the
median distance provides an offset that produces clusters with finer granularity that
is perceived by users to be a good starting point for the threshold [60].

Figure 5.1: overall system specification

In order to calculate the indicators presented in chapter 4 process mining tech-


niques provided by PM4Py library are used; (1) Filtering based on time-frame, case
performance, trace endpoints, trace variants, attributes, and paths, (2) Case man-
agement: statistics on variants and cases. Other python modules are used for data
management and treatment such as, pandas [80], numpy [99], re (Regular expres-
sion operations) and os (Miscellaneous operating system interfaces). The code is
included in the appendix A. In addition, a github repository was created were we
share the code, as well as, the data and a Jupyter Notebook that is used in the
chapter 6 7 .

7
https://github.com/mariagkotsopoulou/MIRI-thesis

34 Chapter 5 Maria Gkotsopoulou


6 | Evaluation

Fundamentally, in this study we set out to answer two main Research Questions, that
were introduced in 1.3. On the one hand, in an effort to support the Analysis phase
of an RPA project, we seek to devise a methodology to determine the suitability of
a process as an automation candidate. To this extent, having as a starting point an
event log, we follow a trace clustering approach resulting in v sublogs, on which we
apply the set of the indicators defined in Chapter 4. Following the assumption that
each sublog could constitute a process, we would then need to apply an aggregating
function on each indicator and specify an objective function, to obtain the overall
score. We will explore this first Research Question using the BPI Challenge 2012
dataset [97].

On the other hand, we also put out a more challenging, though essential in premise,
Research Question, which is that of establishing an objective and generalizable
methodology that our decision support system should fall under. In order to sustain
usefulness of the contribution put forward through the definition of our indicators,
it is necessary to have an evaluation framework. Nevertheless, there are no quality
measures or baseline models to compare against, as such in a Machine Learning
setting. There are neither reference datasets, like for example an annotated image
dataset, to draw a parallel from the image classification context. This is indeed an
important drawback in terms of novel concept development. During the literature
review process of this study, we came across a work by Bosco et al. [18] in which
they generated nine artificial UI logs and most importantly made publicly available.
We will explore this second Research Question using those UI logs.

6.1. BPI Challenge 2012

The BPI Challenge 2012 dataset [97] was also used as a Running example in Chapter
4 to facilitate the concept development and presentation of the indicators, defined
with the objective of guiding the identification of tasks and/or process properties
that are eligible to be automated using RPA. In addition, a github repository was
created were we share this evaluation task in the form of a Jupyter Notebook 1 .

The event log contains 13,087 traces and 262,200 events, recorded from October 1,
2011 to March 14, 2012, starting with a customer submitting an application and
ending with eventual conclusion of that application into an Approval, Cancellation
or Rejection (Declined). The last case is started on February 29, 2012, which means
that 13,087 cases were received within a 152 day period. This means that than on
average 86 new applications are received per day, including the weekend and holi-
days. According to the description provided on the challenge website [97], the event
log contains events from three intertwined subprocesses, which can be distinguished
by the first letter of each event name (A, O and W ). The A subprocess is concerned
with handling the applications themselves. The O subprocess handles offers send
1
https://github.com/mariagkotsopoulou/MIRI-thesis

35
Unleashing Robotic Process Automation Through Process Mining

to customers for certain applications. The W process describes how work items,
belonging to the application, are processed.

Type Event Event description


A_SUBMITTED / A_PARTLYSUBMITTED Initial application submission
A_PREACCEPTED Application pre-accepted but requires additional information
"A_" A_ACCEPTED Application accepted and pending screen for completeness
Application Events A_FINALIZED Application finalized after passing screen for completeness
A_APPROVED / A_REGISTERED / A_ACTIVATED End state of successful (approved) applications
A_CANCELLED, A_DECLINED End states of unsuccessful applications
O_SELECTED Applicant selected to receive offer
O_PREPARED / O_SENT Offer prepared and transmitted to applicant
"O_"
O_SENT BACK Offer response received from applicant
Offer Events
O_ACCEPTED End state of successful offer
O_CANCELLED, O_DECLINED End states of unsuccessful offers
W_Afhandelen leads Following up on incomplete initial submissions
W_Completeren aanvraag Completing pre-accepted applications
W_Nabellen offertes Follow up after transmitting offers to qualified applicants
"W_"
W_Valideren aanvraag Assessing the application
Work item Events
W_Nabellen incomplete dossiers Seeking additional information during assessment phase
W_Beoordelen fraude Investigating suspect fraud cases
W_Wijzigen contractgegevens Modifying approved contracts

Table 6.1: Names and Descriptions of Events

Our approach begins with a log preprocessing step followed by trace clustering.
In the case of the BPI Challenge 2012 dataset [97], we transform the log to a log
representation, using the string event attribute concept:name and the string trace at-
tribute AMOUNT_REQ. This results to obtaining Lv sublogs, specifically resulting
to v = {0, . . . , 10} containing traces {2369, 3480, 1216, 1131, 2225, 494, 1266, 883, 12,
5, 6} respectively. Then, we obtain the indicators for each sublog activity pair.

Figure 6.1: EF (Lv )&EFc (Lv ): BPI Challenge 2012

Activities repeated often, i.e. those having high EF (Lv ) are candidates for RPA,
however, those having a high EFc (Lv ) should be investigated to understand the root
causes of the repeats.

36 Chapter 6 Maria Gkotsopoulou


Unleashing Robotic Process Automation Through Process Mining

activity # in top 3 mean(pct(EFc (Lv )))


W_Afhandelen leads 6 0.91
A_PARTLYSUBMITTED 5 0.65
W_Completeren aanvraag 5 0.96
W_Nabellen offertes 4 0.97
W_Beoordelen fraude 3 0.95
A_DECLINED 2 0.65
A_SUBMITTED 2 0.65
W_Nabellen incomplete dossiers 2 0.99
W_Valideren aanvraag 2 0.93
A_PREACCEPTED 1 0.65
O_SENT 1 0.82

Table 6.2: EF (Lv ) & EFc (Lv ) analysis

In order to determine the candidate tasks for RPA, we first look at Execution Fre-
quency, by obtaining the top 3 frequently executed activities per sublog. Then, to
obtain a global result we count the number of occurrence of the activities. To this
extent, W_Afhandelen leads is a good candidate for RPA since it is found 6 times
in the top 3 per cluster ranking of the Execution Frequency. Nevertheless, it has a
high value of Execution Frequency:case as per the mean percentile value obtained,
so it should be investigated.

As provided in the description, for subprocesses A and O only the event type com-
plete is present, indicating that a task is completed. For the W subprocess, however,
work items can be created in the queue (schedule event type), obtained by the re-
source (start) and released by the resource (complete). To this extent, values greater
than 0 are only obtained for the W subprocess. This is also the case for the Inverse
Stability indicator.

Figure 6.2: ET (Lv ) & ST −1 (Lv ): BPI Challenge 2012

Chapter 6 Maria Gkotsopoulou 37


Unleashing Robotic Process Automation Through Process Mining

activity # in top 3 mean(ET (Lv ))


W_Afhandelen leads 6 759
W_Completeren aanvraag 6 497
W_Beoordelen fraude 4 785
W_Valideren aanvraag 4 1026
W_Nabellen incomplete dossiers 3 654
W_Nabellen offertes 2 547

Table 6.3: Execution time: BPI Challenge 2012

We further analyze Execution Time, by obtaining the top 3 longest executed activ-
ities per sublog. Then, to obtain a global result we count the number of occurrence
of the activities and calculate the mean of the Execution Time. The resulting po-
tential candidates coincide, to a great extent, considering both Execution Frequency
and Execution Time. So, W_Afhandelen leads, W_Completeren aanvraag and
W_Beoordelen fraude would be possible candidates to look into.

activity # in top 3 ST −1 (Lv ) mean(Inverse Stability)


W_Afhandelen leads 6 26264
W_Completeren aanvraag 5 14537
W_Beoordelen fraude 4 12511
W_Nabellen offertes 4 43293
W_Nabellen incomplete dossiers 3 26757
W_Valideren aanvraag 3 3546

Table 6.4: Inverse Stability: BPI Challenge 2012

Nevertheless, since our candidate tasks should be fixed and standardized the indi-
cator of the Inverse Stability, which is based on the variance of its Execution Time,
should not be high. Consequently, taking these results into consideration it seems
that W_Afhandelen leads is not a good candidate.

Figure 6.3: P F v(Lv ): BPI Challenge 2012

38 Chapter 6 Maria Gkotsopoulou


Unleashing Robotic Process Automation Through Process Mining

We see that some activities have an estimate of 2 Prior/follow Variability, im-


plying that we are dealing with fixed procedures, that constitute ideal candidates
for RPA. From the sole perspective of this indicator A_PARTLYSUBMITTED,
A_PREACCEPTED and A_DECLINED would be potential ones to look into.
Taking into account the rest of the indicators yields, however, non conclusive re-
sults, since they have high Execution Frequency. On the other hand, though they
do not raise alarming concerns in terms of Execution Frequency: case, they do not
present means to build a strong case looking at Execution Time.

activity # in top 3 P F v mean(Prior/follow variability)


A_PARTLYSUBMITTED 10 1.95
A_PREACCEPTED 6 2.00
A_DECLINED 5 1.80
O_SENT 2 2.00
O_SENT_BACK 2 1.47
W_Beoordelen fraude 2 1.32
A_CANCELLED 1 1.48
A_REGISTERED 1 2.00
A_SUBMITTED 1 1.00
O_ACCEPTED 1 2.00
W_Afhandelen leads 1 1.25
W_Nabellen incomplete dossiers 1 2.00

Table 6.5: Prior/follow variability: BPI Challenge 2012

The proposed indicators in our study are calculated for each sublog activity pair,
and, we have presented a characterisation of these activities averaged across each
sublog. Crucially, in order to determine the suitability of a process, as an automation
candidate, we need to present an aggregated characterisation for each sublog, based
on the assumption that each sublog could constitute a process.

sublog EF (Lv ) EFc (Lv ) ET (Lv ) ST −1 (Lv ) P F v(Lv )


0 4145 2.5 128 3720 0.65
1 448 0.4 27 1157 0.22
2 774 0.9 34 1572 0.39
3 845 0.8 65 2330 0.41
4 597 0.3 42 1138 0.26
5 719 1.5 80 4305 0.54
6 1593 1.3 47 2535 0.52
7 1775 2.8 187 6326 0.63
8 19 3.7 101 43 1.2
9 6 1.2 113 1036 0.3
10 3 0.5 8 11 0.26

Table 6.6: Indicators Mean: BPI Challenge 2012

Then, the selection methodology of the candidate processes is based on the identi-
fication of processes with:

Chapter 6 Maria Gkotsopoulou 39


Unleashing Robotic Process Automation Through Process Mining

• high volume of manual tasks and handling time: with an objective

min [rank+ (EF (Lv )) + 2*rank+ (ET (Lv ))]

we obtain sublog {7, 0, 3, 5, 9, 6, 8, 2, 4, 1, 10}


• small number of steps: with an objective

min [rank− (EFc (Lv ))]

we obtain sublog {4, 1, 10, 3, 2, 9, 6, 5, 0, 7, 8}


• fixed procedures that are standardized and mature: with an objective

min [mean (rank+ (P F v(Lv )), rank− (ST −1 (Lv )))]

we obtain sublog {8, 10, 0, 9, 3, 6, 2, 7, 5, 4, 1}


Combining the three criteria with an equal weight we obtain sublog {9, 3, 0, 10, 8, 7,
4, 6, 2, 5, 1}. Considering that the goal is to determine the suitability of a process as
an automation candidate the devised methodology merely provides with means of
prioritizing when it comes to directing the efforts of undertaking an RPA project.

6.2. Artifical UI logs

As presented in Chapter 3, Bosco et al. [18] propose a method for the discovery of
routines that are fully deterministic, and hence, automatable using RPA tools. Given
a UI log, the goal of their approach is to discover automatable routines, i.e. sets of
routine traces having the parameters’ values of each of their actions derivable from
previous actions parameters’ values [18]. They conducted an experiment to assess
the ability of their approach to correctly discover all the automatable (sub)routines
recorded in a set of UI logs. To this end, they generated nine artificial UI logs, each
log containing a different number of automatable (sub)routines of varying complex-
ity. These UI logs were generated by simulating nine CPNs designed with the CPN
Tool [51]. The first six CPNs are simple and represent common real-life scenarios,
capturing clear routines with a specific goal. The last three CPNs have the highest
complexity and the routines they represent are not easily interpretable. The only
log where their approach failed to discover two routines is log L3, which contains
loops. Consider, for example, the L1 UI log which contains 100 traces, just one trace
variant, each containing 14 events.

σ0 = hopenFile, goToUrl, insertValue2 , clickButton, waitResponse, clickButton2 ,


insertValue, copy, paste, copy, paste, clickButtoni

Following our presented methodology, we apply trace clustering and obtain only
one sublog which coincides with the fact that L1 contains one routine as presented
in Table 3 [18]. As presented in Table 3 [18] the L1 UI log contains 1400 actions
coinciding with the sum of the Execution Frequency indicator (see table 6.7), while
as presented in Table 4 [18] it contains 13 distinct automatable actions. As per [18],

40 Chapter 6 Maria Gkotsopoulou


Unleashing Robotic Process Automation Through Process Mining

all the actions in L1 are automatable except the opening file one, since the input
file is chosen randomly.

activity EF (L0 ) EFc (L0 ) ET (L0 ) ST −1 (L0 ) P F v(L0 )


openFile 100 1.0 0 0 1.00
goToUrl 100 1.0 0 0 2.00
insertValue 300 3.0 0 0 0.50
clickButton 400 4.0 0 0 0.36
waitResponse 100 1.0 0 0 2.00
copy 200 2.0 0 0 1.25
paste 200 2.0 0 0 1.25

Table 6.7: Indicators of the artificially generated UI logs L1

However, our indicators are defined on the activity level so in Table 6.7 there are 7
activities. An action, as per the definition 1 of [18], not only contains the activity
(referred to as action type (e.g. click button)) but also the action parameters, the
values assigned to the action parameters and the function matching each action
parameter to its value. Though, two actions that have the same set of parameters,
but different assigned values, are still considered equal [18]. To this extent, even if the
click button has an estimated low Prior/follow variability indicator, suggesting that
it might not be a good candidate for RPA, instead, the trace σ0 is deterministic and
constitutes an automatable routine. Indeed, our methodology takes only control-
flow into consideration and neglects the discovery of data conditions, since it does
not relate post-conditions with pre-conditions [65]. As put forward by Leno et al.
[66] in the context of RPA, the goal is to discover executable specifications that
capture the mapping between the outputs and the inputs of the actions performed
during a routine. To this end, our proposed methodology presents a shortcoming.

P F v(Lv ) P F v(Lv )
UI log Routines AA % Processes min( EF v )
c (L )
max( EF v )
c (L )

3 6 94.4 128 0.4 0.8


1 1 92.9 27 0.6 0.6
2 2 84.2 34 0.6 0.7
6 2 69.2 65 0.5 1.1
9 18 68.6 42 0.3 0.8
8 8 60 80 0.5 1
4 1 47.1 47 0.6 0.6
7 7 25 187 0.5 1
5 12 16.7 8 0.8 1.1

Table 6.8: Results comparison

Our presented methodology is repeated on each UI log with the steps of trace clus-
tering to obtain sublogs on which the indicators are calculated. When it comes to
establishing an objective and generalizable methodology to evaluate the proposed
quantifiable decision support framework of process task selection for RPA there are
some limitations. One issue lies in the fact that the indicators proposed herewith

Chapter 6 Maria Gkotsopoulou 41


Unleashing Robotic Process Automation Through Process Mining

are of relative scale. In an effort to build an overall, and as much as possible,


absolute score we focus on the indicator of Prior/follow variability and Execution
Frequency:case. The range of Prior/follow variability is [0, 2] and the higher it is
the less variable the overall process, while the higher the Execution Frequency:case
is, represents a signal to possible anomalies in the process given the high number of
repeats. To this extent, we define the overall score as the fraction of the two indica-
tors, as the characterization of the automation potential of the process represented
by each sublog. Thus, this metric represents the basis of the quantifiable decision
support framework in an attempt to refine the presented methodology in Section
6.1. To evaluate this proposal we use the automation potential of the UI logs as
reported by Bosco et al. [18]. In Table 6.8, the UI logs are ordered in descending
order as per the percentage of distinct automatable actions (AA%). Comparing the
minimum and maximum value of the defined metric to the AA% we conclude that
this proposed metric fails to capture the automation potential of the processes.

42 Chapter 6 Maria Gkotsopoulou


7 | Conclusion

While RPA is gaining traction in IT consulting, mature research on RPA is still in


its infancy. In Enriquez et al. [35] systematic mapping study, aimed at analyzing
the current state-of-the-art of RPA, they identified 54 primary studies over the
span of 8 years (2012-2019). As Wanner et al. [101] put fourth, the success of
RPA projects depends significantly on the availability of a standardized method to
quantify and compare the automation potential of process tasks. Even if some high-
level criteria have been proposed to measure the suitability of processes for RPA,
there is still no agreed upon standardized decision support system to follow in the
Analysis phase of the RPA project lifecycle. Our goal was to develop a quantifiable
methodology to support companies for process selection in RPA projects. Based on a
literature analysis, we identified characteristics of processes that classify as potential
automation candidates. These characteristics were then translated in quantifiable
metrics operationalizing techniques from Process Mining. The objective is that of
directing the efforts of RPA towards a process and justifying the time and economic
investment associated to the undertaking of an RPA project. To this extent, our
approach departs from an event log, and in order to cope with high variety of
behavior that is captured in certain event logs, it is followed by a log preprocessing
step that includes trace clustering. The sublogs obtained as the trace clustering
result are assumed to constitute processes.

We followed a two step evaluation procedure. First, the BPI Challenge 2012 dataset
[97] was used to calculate the indicators on each sublog activity pair and ideate an
overall score based on an aggregating function. However, the devised methodology
doesn’t determine the suitability of a process as an automation candidate, but it
merely provides with means of prioritizing when it comes to directing the efforts of
undertaking an RPA project. Another drawback identified is that we analyzed pro-
cess tasks independent of each other and neglected dependencies, where automation
in one process might influence other related process tasks. On the other hand, in
the second step of our evaluation procedure we used artificial UI logs included in
Bosco et al. [18] as a reference. In order to establish an objective and generalizable
methodology to evaluate the proposed quantifiable decision support framework of
process task selection for RPA, we build a metric using the indicator of Prior/follow
variability and Execution Frequency:case. The choice is made in an effort to build
an overall, and as much as possible, absolute score, that represents the basis of
the quantifiable decision support framework, in an attempt to refine the presented
methodology in the previous step of the evaluation procedure. However, comparing
the defined metric to the percentage of distinct automatable actions presented in
Bosco et al. [18], we conclude that this proposed metric fails to capture the automa-
tion potential of the processes. Moreover, our indicators are calculated for each
sublog activity while Bosco et al. [18] focuses on actions, thus making apparent the
shortcoming of our methodology, as it takes only control-flow into consideration and
neglects the discovery of data conditions, since it does not relate post-conditions
with pre-conditions [65]. As put forward by Leno et al. [66] in the context of RPA,
the goal is to discover executable specifications that capture the mapping between

43
Unleashing Robotic Process Automation Through Process Mining

the outputs and the inputs of the actions performed during a routine.

44 Chapter 7 Maria Gkotsopoulou


Bibliography

[1] W. M. P. van der Aalst, “Process mining - discovery, conformance and en-
hancement of business processes”, 2011.
[2] ——, “Decomposing petri nets for process mining: A generic approach”, Dis-
tributed and Parallel Databases, vol. 31, pp. 471–507, 2013.
[3] W. M. P. van der Aalst, M. Bichler, and A. Heinzl, “Robotic process automa-
tion”, Business & Information Systems Engineering, vol. 60, no. 4, pp. 269–
272, 2018.
[4] W. M. P. van der Aalst, A. Bolt, and S. J. van Zelst, “Rapidprom: Mine your
processes and not just your data”, ArXiv, vol. abs/1703.03740, 2017.
[5] W. M. P. van der Aalst, A. J. M. M. Weijters, and L. Maruster, “Workflow
mining: Discovering process models from event logs”, IEEE Transactions on
Knowledge and Data Engineering, vol. 16, pp. 1128–1142, 2004.
[6] W. van der Aalst, “Process mining-data science in action . 2016”,
[7] S. Agostinelli, A. Marrella, and M. Mecella, “Towards intelligent robotic pro-
cess automation for bpmers”, ArXiv, vol. abs/2001.00804, 2020.
[8] S. Aguirre and A. Rodríguez, “Automation of a business process using robotic
process automation (rpa): A case study”, in WEA, 2017.
[9] S. Anagnoste et al., “Setting up a robotic process automation center of ex-
cellence”, Management Dynamics in the Knowledge Economy, vol. 6, no. 2,
pp. 307–332, 2018.
[10] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics: Ordering
points to identify the clustering structure”, in SIGMOD ’99, 1999.
[11] A. Asatiani and E. Penttinen, “Turning robotic process automation into com-
mercial success – case opuscapita”, Journal of Information Technology Teach-
ing Cases, vol. 6, pp. 67–74, 2016.
[12] A. Augusto, R. Conforti, M. Dumas, M. L. Rosa, F. M. Maggi, A. Marrella,
M. Mecella, and A. Soo, “Automated discovery of process models from event
logs: Review and benchmark”, IEEE Transactions on Knowledge and Data
Engineering, vol. 31, pp. 686–705, 2017.
[13] A. Augusto, M. Dumas, and M. L. Rosa, “Dataset for testing the Discovery
of Automatable Routines from User Interaction Logs”, Mar. 2019. doi: 10.
6084/m9.figshare.7850918.v1. [Online]. Available: https://figshare.
com/articles/Dataset_for_testing_the_Discovery_of_Automatable_
Routines_from_User_Interaction_Logs/7850918.
[14] R. Beetz and Y. Riedl, “Robotic process automation: Developing a multi-
criteria evaluation model for the selection of automatable business processes”,
in AMCIS, 2019.
[15] A. Berti, S. J. van Zelst, and W. M. P. van der Aalst, “Pm4py web services:
Easy development, integration and deployment of process mining features in
any application stack”, in BPM, 2019.
[16] ——, “Process mining for python (pm4py): Bridging the gap between process-
and data science”, ArXiv, vol. abs/1905.06169, 2019.
[17] M. Boltenhagen, T. Chatain, and J. Carmona, “Generalized alignment-based
trace clustering of process behavior”, in Petri Nets, 2019.

45
Unleashing Robotic Process Automation Through Process Mining

[18] A. Bosco, A. Augusto, M. Dumas, M. L. Rosa, and G. Fortino, “Discovering


automatable routines from user interaction logs”, in BPM Forum, 2019.
[19] R. P. J. C. Bose and W. M. P. van der Aalst, “Abstractions in process mining:
A taxonomy of patterns”, in BPM, 2009.
[20] ——, “Context aware trace clustering: Towards improving process mining
results”, in SDM, 2009.
[21] ——, “Trace clustering based on conserved patterns: Towards achieving bet-
ter process models”, in Business Process Management Workshops, 2009.
[22] J. C. A. M. Buijs, B. F. van Dongen, and W. M. P. van der Aalst, “Quality
dimensions in process discovery: The importance of fitness, precision, gener-
alization and simplicity”, Int. J. Cooperative Inf. Syst., vol. 23, 2014.
[23] A. Burattin and A. Sperduti, “Heuristics miner for time intervals”, in ESANN,
2010.
[24] B. Bygstad, “Generative innovation: A comparison of lightweight and heavy-
weight it”, Journal of Information Technology, vol. 32, no. 2, pp. 180–193,
2017.
[25] R. J. G. B. Campello, D. Moulavi, and J. Sander, “Density-based clustering
based on hierarchical density estimates”, in PAKDD, 2013.
[26] Y. Cao, “Attribute-Driven Hierarchical Clustering of Event Data in Process
Mining”, Master’s thesis, RWTH Aachen University, Germany, 2020.
[27] J. Carmona and M. Solé, “Pmlab: An scripting environment for process min-
ing”, in BPM, 2014.
[28] S. Celonis, Blog post on rpa and process mining, https://www.celonis.
com/blog/how-to-get-the-most-out-of-robotic-process-automation,
2017.
[29] W. H. E. Day and H. Edelsbrunner, “Efficient algorithms for agglomerative
hierarchical clustering methods”, Journal of Classification, vol. 1, pp. 7–24,
1984.
[30] L. Delchambre, “Weighted principal component analysis: A weighted covari-
ance eigendecomposition approach”, Monthly Notices of the Royal Astronom-
ical Society, vol. 446, no. 4, pp. 3545–3555, Dec. 2014, issn: 0035-8711. doi:
10.1093/mnras/stu2219. [Online]. Available: http://dx.doi.org/10.
1093/mnras/stu2219.
[31] C. Di Bisceglie, E. Ramezani Taghiabadi, and H. Aklecha, Data-driven in-
sights to robotic process automation with process mining, 2019.
[32] B. F. van Dongen, A. K. A. de Medeiros, H. M. W. Verbeek, A. J. M. M.
Weijters, and W. M. P. van der Aalst, “The prom framework: A new era in
process mining tool support”, in ICATPN, 2005.
[33] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification. John Wiley
& Sons, 2012.
[34] M. Dumas, M. L. Rosa, J. Mendling, and H. A. Reijers, “Fundamentals of
business process management”, in Springer Berlin Heidelberg, 2018.
[35] J. G. Enriquez, A. Jiménez-Ramírez, F. J. Domínguez-Mayo, and J. A. García-
García, “Robotic process automation: A scientific and industrial systematic
mapping study”, IEEE Access, vol. 8, pp. 39 113–39 129, 2020.
[36] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm
for discovering clusters in large spatial databases with noise”, in KDD, 1996.

46 Chapter 7 Maria Gkotsopoulou


Unleashing Robotic Process Automation Through Process Mining

[37] D. R. Ferreira, M. Zacarias, M. Malheiros, and P. Ferreira, “Approaching


process mining with sequence clustering: Experiments and findings”, in BPM,
2007.
[38] H. P. Fung, “Criteria, use cases and effects of information technology process
automation (itpa)”, Advances in Robotics and Automation, vol. 3, pp. 1–10,
Jul. 2014.
[39] J. Gao, S. van Zelst, X. Lu, and W. Aalst, “Automated robotic process au-
tomation: A self-learning approach”, Oct. 2019, isbn: 978-3-030-33245-7. doi:
10.1007/978-3-030-33246-4_6.
[40] J. Geyer-Klingeberg, J. Nakladal, F. Baldauf, and F. Veit, “Process mining
and robotic process automation: A perfect match”, in BPM, 2018.
[41] S. Goedertier, J. D. Weerdt, D. Martens, J. Vanthienen, and B. Baesens,
“Process discovery in event logs: An application in the telecom industry”,
Appl. Soft Comput., vol. 11, pp. 1697–1710, 2011.
[42] E. Gonzalez Lopez de Murillas, “Process mining on databases: Extracting
event data from real-life data sources”, 2019.
[43] G. Greco, A. Guzzo, L. Pontieri, and D. Saccá, “Discovering expressive pro-
cess models by clustering log traces”, IEEE Transactions on Knowledge and
Data Engineering, vol. 18, pp. 1010–1027, 2006.
[44] X. W. Group et al., “Ieee standard for extensible event stream (xes) for
achieving interoperability in event logs and event streams”, IEEE Std 1849,
pp. 1–50, 2016.
[45] C. W. Günther and W. M. P. van der Aalst, “Fuzzy mining - adaptive process
simplification based on multi-perspective metrics”, in BPM, 2007.
[46] C. C. Günther, “Process mining in flexible environments”, 2009.
[47] P. R. Halmos, “Naive set theory”, 1960.
[48] R. W. Hamming, “Error detecting and error correcting codes”, Bell System
Technical Journal, vol. 29, pp. 147–160, 1950.
[49] B. Hompes, J. J. Buijs, V. der Aalst, P. M. Dixit, and J. Buurman, “Discov-
ering deviating cases and process variants using trace clustering”, 2015.
[50] G. Janssenswillen, B. Depaire, M. Swennen, M. Jans, and K. Vanhoof, “Bu-
par: Enabling reproducible business process analysis”, Knowl. Based Syst.,
vol. 163, pp. 927–930, 2019.
[51] K. Jensen, L. M. Kristensen, and L. Wells, “Coloured petri nets and cpn tools
for modelling and validation of concurrent systems”, International Journal on
Software Tools for Technology Transfer, vol. 9, pp. 213–254, 2007.
[52] H. K. Kanagala and V. V. J. R. Krishnaiah, “A comparative study of k-means,
dbscan and optics”, 2016 International Conference on Computer Communi-
cation and Informatics (ICCCI), pp. 1–6, 2016.
[53] M. Kerremans, Gartner market guide for process mining, https : / / www .
gartner.com/doc/3870291/market-guide-process-mining, 2018.
[54] M. Kirchmer, “Robotic process automation-pragmatic solution or dangerous
illusion”, Business Transformation & Operational Excellence World Summit
(BTOES), 2017.
[55] M. Lacity and L. Willcocks, “Robotic process automation: Mature capabilities
in the energy sector (the outsourcing unit research paper series paper 15/06)”,
Hentet fra https: // irpaai. com/ robotic-process-automation-mature-
capabilities-in-the-energysector , 2015.

Chapter 7 Maria Gkotsopoulou 47


Unleashing Robotic Process Automation Through Process Mining

[56] M. Lacity and L. Willcocks, “What knowledge workers stand to gain from
automation”, Harvard Business Review, vol. 19, no. 6, 2015.
[57] M. C. Lacity and L. P. Willcocks, “A new approach to automating services”,
MIT Sloan Management Review, 2017.
[58] M. Lacity and L. P. Willcocks, “Innovating in service: The role and manage-
ment of automation”, in Dynamic Innovation in Outsourcing, Springer, 2018,
pp. 269–325.
[59] ——, “Robotic process automation at telefónica o2”, MIS Quarterly Execu-
tive, vol. 15, 2015.
[60] G. T. Lakshmanan, S. Rozsnyai, and F. Wang, “Investigating clinical care
pathways correlated with outcomes”, in BPM, Berlin, Heidelberg: Springer
Berlin Heidelberg, 2013, pp. 323–338.
[61] S. Lazarus, Achieving a successful robotic process automation implementa-
tion: A case study of vodafone and celoni, https : / / spendmatters . com /
2018/06/07/achieving-a-successful-robotic-process-automation-
implementation-a-case-study-of-vodafone-and-ce-lonis/, 2018.
[62] M. Leemans, W. M. P. van der Aalst, and M. van den Brand, “Recursion
aware modeling and discovery for hierarchical software event log analysis”,
2018 IEEE 25th International Conference on Software Analysis, Evolution
and Reengineering (SANER), pp. 185–196, 2018.
[63] S. J. J. Leemans, D. Fahland, and W. M. P. van der Aalst, “Discovering
block-structured process models from event logs - a constructive approach”,
in Petri Nets, 2013.
[64] ——, “Using life cycle information in process discovery”, in Business Process
Management Workshops, 2015.
[65] V. Leno, M. Dumas, F. Maggi, and M. La Rosa, “Multi-perspective process
model discovery for robotic process automation”, in CEUR Workshop Pro-
ceedings, vol. 2114, 2018, pp. 37–45.
[66] V. Leno, M. Dumas, M. L. Rosa, F. M. Maggi, and A. Polyvyanyy, “Au-
tomated discovery of data transformations for robotic process automation”,
ArXiv, vol. abs/2001.01007, 2020.
[67] V. Leno, A. Polyvyanyy, M. L. Rosa, M. Dumas, and F. M. Maggi, “Action
logger: Enabling process mining for robotic process automation”, in BPM,
2019.
[68] M. de Leoni, W. M. P. van der Aalst, and M. Dees, “A general process mining
framework for correlating, predicting and clustering dynamic behavior based
on event logs”, Inf. Syst., vol. 56, pp. 235–257, 2016.
[69] H. Leopold, H. van der Aa, and H. A. Reijers, “Identifying candidate tasks
for robotic process automation in textual process descriptions”, in BPMDS /
EMMSAD@CAiSE, 2018.
[70] C. Li, M. Reichert, and A. Wombacher, “Mining process variants: Goals and
issues”, 2008 IEEE International Conference on Services Computing, vol. 2,
pp. 573–576, 2008.
[71] J. Lindström, P. Kyösti, and J. Delsing, European roadmap for industrial
process automation, 2018.
[72] P. Liu, D. Zhou, and N. Wu, “Vdbscan: Varied density based spatial clus-
tering of applications with noise”, 2007 International Conference on Service
Systems and Service Management, pp. 1–4, 2007.

48 Chapter 7 Maria Gkotsopoulou


Unleashing Robotic Process Automation Through Process Mining

[73] X. X. Lu, “Using behavioral context in process mining : Exploration, prepro-


cessing and analysis of event data”, 2018.
[74] J. B. MacQueen, “Some methods for classification and analysis of multivariate
observations”, 1967.
[75] J. Nakatumba and W. M. P. van der Aalst, “Analyzing resource behavior
using process mining”, in Business Process Management Workshops, 2009.
[76] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and É. Duchesnay, “Scikit-learn:
Machine learning in python”, Journal of Machine Learning Research, vol. 12,
no. 85, pp. 2825–2830, 2011. [Online]. Available: http://jmlr.org/papers/
v12/pedregosa11a.html.
[77] E. Penttinen, H. Kasslin, and A. Asatiani, “How to choose between robotic
process automation and back-end system automation?”, in ECIS, 2018.
[78] A. Polyvyanyy, J. Vanhatalo, and H. Völzer, “Simplified computation and
generalization of the refined process structure tree”, in WS-FM, 2010.
[79] A. J. Ramirez, H. A. Reijers, I. Barba, and C. D. Valle, “A method to improve
the early stages of the robotic process automation lifecycle”, in CAiSE, 2019.
[80] J. Reback, W. McKinney, jbrockmendel, J. V. den Bossche, T. Augspurger,
P. Cloud, gfyoung, Sinhrks, A. Klein, M. Roeschke, S. Hawkins, J. Tratner,
C. She, W. Ayd, T. Petersen, M. Garcia, J. Schendel, A. Hayden, MomIsBest-
Friend, V. Jancauskas, P. Battiston, S. Seabold, chris-b1, h-vetinari, S. Hoyer,
W. Overmeire, alimcmaster1, K. Dong, C. Whelan, and M. Mehyar, Pandas-
dev/pandas: Pandas 1.0.3, version v1.0.3, Mar. 2020. doi: 10.5281/zenodo.
3715232. [Online]. Available: https://doi.org/10.5281/zenodo.3715232.
[81] M. Reichert and B. Weber, “Enabling flexibility in process-aware information
systems: Challenges, methods, technologies”, 2012.
[82] W. Reinhardt, B. Schmidt, P. Sloep, and H. Drachsler, “Knowledge worker
roles and actions—results of two empirical studies”, Knowledge and Process
Management, vol. 18, no. 3, pp. 150–174, 2011.
[83] D. Reissner, R. Conforti, M. Dumas, M. L. Rosa, and A. Armas-Cervantes,
“Scalable conformance checking of business processes”, in OTM Conferences,
2017.
[84] A. Rogge-Solti, A. Senderovich, M. Weidlich, J. Mendling, and A. Gal, “In
log and model we trust? a generalized conformance checking framework”, in
BPM, 2016.
[85] M. L. Rosa, H. A. Reijers, W. M. P. van der Aalst, R. M. Dijkman, J.
Mendling, M. Dumas, and L. García-Bañuelos, “Apromore: An advanced pro-
cess model repository”, Expert Syst. Appl., vol. 38, pp. 7029–7040, 2011.
[86] A. W. Scheer, F. Abolhassan, W. Jost, and M. Kirchmer, “Business process
automation”, 2004.
[87] M. Schmitz, C. Dietze, and C. Czarnecki, “Enabling digital transformation
through robotic process automation at deutsche telekom”, in Digitalization
Cases, Springer, 2019, pp. 15–33.
[88] M. Song, C. W. Günther, and W. M. P. van der Aalst, “Trace clustering in
process mining”, in Business Process Management Workshops, 2008.
[89] Y. Sun, B. Bauer, and M. Weidlich, “Compound trace clustering to generate
accurate and simple sub-process models”, in ICSOC, 2017.

Chapter 7 Maria Gkotsopoulou 49


Unleashing Robotic Process Automation Through Process Mining

[90] T. Sztyler, J. Carmona, J. Völker, and H. Stuckenschmidt, “Self-tracking


reloaded: Applying process mining to personalized health care from labeled
sensor data”, Trans. Petri Nets Other Model. Concurr., vol. 11, pp. 160–180,
2016.
[91] P.-N. Tan, M. S. Steinbach, and V. Kumar, “Introduction to data mining,
(first edition)”, 2005.
[92] N. Tax, N. Sidorova, R. Haakma, and W. M. P. van der Aalst, “Mining local
process models”, J. Innov. Digit. Ecosyst., vol. 3, pp. 183–196, 2016.
[93] C. Tornbohm, “Market guide for robotic process automation software”, Gart-
ner. com, 2017.
[94] P. Toth, “Dynamic programming algorithms for the zero-one knapsack prob-
lem”, Computing, vol. 25, pp. 29–45, 2005.
[95] W. M. Van der Aalst, B. F. van Dongen, C. W. Günther, A. Rozinat, E. Ver-
beek, and T. Weijters, “Prom: The process mining toolkit.”, BPM (Demos),
vol. 489, no. 31, p. 2, 2009.
[96] W. Van Der Aalst, A. Adriansyah, A. K. A. De Medeiros, F. Arcieri, T.
Baier, T. Blickle, J. C. Bose, P. Van Den Brand, R. Brandtjen, J. Buijs,
et al., “Process mining manifesto”, in International Conference on Business
Process Management, Springer, 2011, pp. 169–194.
[97] Van Dongen, B.F. (Boudewijn), Bpi challenge 2012, nl, https://data.4tu.
nl/repository/uuid:3926db30- f712- 4394- aebc- 75976070e91f, 2012.
doi: 10.4121/UUID:3926DB30-F712-4394-AEBC-75976070E91F.
[98] G. M. Veiga and D. R. Ferreira, “Understanding spaghetti models with se-
quence clustering for prom”, in Business Process Management Workshops,
2009.
[99] S. van der Walt, S. C. Colbert, and G. Varoquaux, “The numpy array: A
structure for efficient numerical computation”, Computing in Science & En-
gineering, vol. 13, pp. 22–30, 2011.
[100] Y. Wang, Y. Gu, and J. Shun, “Theoretically-efficient and practical parallel
dbscan”, ArXiv, vol. abs/1912.06255, 2019.
[101] J. Wanner, A. Hofmann, M. Fischer, F. Imgrund, C. Janiesch, and J. Geyer-
Klingeberg, “Process selection in rpa projects - towards a quantifiable method
of decision making”, in ICIS, 2019.
[102] B. Weber, “Supporting process mining workflows with rapidprom”, 2014.
[103] J. D. Weerdt, S. K. L. M. vanden Broucke, J. Vanthienen, and B. Baesens,
“Active trace clustering for improved process discovery”, IEEE Transactions
on Knowledge and Data Engineering, vol. 25, pp. 2708–2720, 2013.
[104] A. J. M. M. Weijters and W. M. P. van der Aalst, “Rediscovering workflow
models from event-based data using little thumb”, Integr. Comput. Aided
Eng., vol. 10, pp. 151–162, 2003.
[105] A. J. M. M. Weijters and J. T. S. Ribeiro, “Flexible heuristics miner (fhm)”,
2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM),
pp. 310–317, 2011.
[106] L. P. Willcocks, M. Lacity, and A. Craig, “Robotic process automation at
xchanging”, LSE Research Online Documents on Economics, 2015.
[107] ——, “The it function and robotic process automation”, 2015.

50 Chapter 7 Maria Gkotsopoulou


Unleashing Robotic Process Automation Through Process Mining

[108] L. Willcocks, M. Lacity, and A. Craig, “Robotic process automation: Strategic


transformation lever for global business services?”, Journal of Information
Technology Teaching Cases, vol. 7, no. 1, pp. 17–28, 2017.
[109] J. Xu and J. Liu, “A profile clustering based event logs repairing approach
for process mining”, IEEE Access, vol. 7, pp. 17 872–17 881, 2019.
[110] Y. Yu, X. Li, H. Liu, J. Mei, N. Mukhi, V. Ishakian, G. T. Xie, G. T. Lak-
shmanan, and M. Marin, “Case analytics workbench: Platform for hybrid
process model creation and evolution”, in BPM, 2015.
[111] S. J. van Zelst, “Process mining with streaming data”, 2019.

Chapter Maria Gkotsopoulou 51


A | Appendix

A.1. Code implementation


'''
Code included in a thesis presented for the degree of
Master in Innovation and Research in Informatics - Data Science
Facultat d’Informàtica de Barcelona (FIB)
Universitat Politècnica De Catalunya (UPC)

Indicators & Plots to support the identification of tasks and/or process properties
that are eligible to be automated using RPA.

Parameters
-----------
log
Trace log
parameters
Parameters of the log representation algorithm:
str_ev_attr -> String event attributes to consider in feature representation (single
,→ necessary)
str_tr_attr -> String trace attributes to consider in feature representation (single
,→ additional optional)
Returns
-----------
indicator plots
indicator
A list containing, per activity all indicators in each sublog

folder structure
-home
--code: where this code is found
--data: XES files are found
--plots: resulting plots are saved
The code can be run from command line having navigated to code folder by executing:
`python pm4RPA.py financial_log.xes concept:name AMOUNT_REQ`
'''

import os, sys, random, re


import pandas as pd
import numpy as np
# used in generatePlots()
import matplotlib.pyplot as plt
# Importing IEEE XES files used in main()
from pm4py.objects.log.importer.xes import factory as xes_import_factory
# Apply PCA + DBSCAN clustering after creating a representation of the log; used in clusterlog()
from pm4pyclustering.algo.other.clustering import factory as clusterer
# Retrieve the number of occurrences of the activities; used in execFreq()
from pm4py.algo.filtering.log.attributes import attributes_filter
# Retrieve variants and the number of occurence; used in execFreqCase()
from pm4py.statistics.traces.log import case_statistics
# Converts a log to interval format (e.g. an event has two timestamps)
# from lifecycle format (an event has only a timestamp, and a transition lifecycle); used in
,→ execTime()
from pm4py.objects.log.util import interval_lifecycle
# Get the paths (pair activities) of the log along with their count; used in priorFollowVar()
from pm4py.algo.filtering.log.paths import paths_filter

def clusterlog(log,clusterParams):
print("Apply PCA + DBSCAN clustering from log representation obtained using ",
,→ clusterParams)
random.seed(42)
clusters = clusterer.apply(log, parameters = clusterParams)
for i in range(len(clusters)):
print("sublog",i," has ",len(clusters[i]), " traces")
return(clusters)

52
Unleashing Robotic Process Automation Through Process Mining

def execFreq(clusters, activityKey):


EF = []
for i in range(len(clusters)):
activities_count = attributes_filter.get_attribute_values(clusters[i],attribute_key
,→ =activityKey)
EF.append(activities_count)
EF_df = pd.DataFrame.from_dict(EF, orient='columns', dtype=None).T
EF_df = EF_df.reset_index().melt( id_vars='index', var_name='cluster',
,→ value_name='activityCount')
EF_df = EF_df.fillna(0)
EF_df= EF_df.rename(columns={'index':'activity'})
############## Execution Frequency: case ############
EF_EFc = execFreqCase(clusters,EF_df)
return(EF_EFc)

def execFreqCase(clusters, EF_df):


activityL = EF_df['activity'].unique().tolist()
variant_EF_A = []
for clusteri in range(len(clusters)):
# per cluster get the variants along with their count
variants_count = case_statistics.get_variant_statistics(clusters[clusteri])
for variant in range(len(variants_count)):
# per variant count the number of occurence of each activity
for key, value in variants_count[variant].items():
if key=="variant":
activityVariant = []
for i in range(len(activityL)):
EF = len(re.findall(activityL[i], value))
if EF>0:
activityVariant.append({'cluster':clusteri,'variant':variant,
'activity': activityL[i], 'EF': EF})
else:
#also include the count of this variant
for item in activityVariant:
item.update({"count":value})
variant_EF_A.extend(activityVariant)
variant_EF_A_df = pd.DataFrame.from_dict(variant_EF_A, orient='columns', dtype=None)
variant_EF_A_df['EFsum'] = variant_EF_A_df.apply(lambda x: x['EF']*x['count'], axis=1)
EFc_df = variant_EF_A_df.groupby(by=['cluster','activity']).agg({'EFsum':
,→ "sum",'count':"sum"}).reset_index()
EFc_df['EFc'] = EFc_df.apply(lambda x: x['EFsum']/x['count'], axis=1)
EF_EFc_df = pd.merge(left = EF_df,
right=EFc_df.drop(['EFsum','count'],
,→ axis=1),
right_on=['cluster','activity'],
left_on =['cluster','activity'], how='left')
EF_EFc_df = EF_EFc_df.rename(columns={'activityCount' : 'EF' })
EF_EFc_df = EF_EFc_df.fillna(0)
return(EF_EFc_df)

def execTime(clusters, EF_EFc):


ET = []
for clusteri in range(len(clusters)):
enriched_log = interval_lifecycle.assign_lead_cycle_time(clusters[clusteri])
for i in range(len(enriched_log)):
for j in range(len(enriched_log[i])):
activity = enriched_log[i][j]["concept:name"]
duration = enriched_log[i][j]['@@duration']
ET.append({'cluster':clusteri,'activity': activity, 'duration': duration})
ET_df = pd.DataFrame.from_dict(ET, orient='columns', dtype=None)
ET_df_m = ET_df.groupby(['cluster', 'activity'])['duration'].mean().reset_index()
ET_df_m = ET_df_m.rename(columns={'duration':'ET'})
# per cluster mean activity duration (ET) & activity event duration
ET_df = pd.merge(left = ET_df , right=ET_df_m, right_on=['cluster','activity'],
left_on =['cluster','activity'], how='inner')
############## Inverse Stability ############
EF_EFc_ET_ST = invStability(EF_EFc, ET_df)
return(EF_EFc_ET_ST)

def invStability(EF_EFc, ET_df):

Chapter A Maria Gkotsopoulou 53


Unleashing Robotic Process Automation Through Process Mining

# calculate mean squared differences between @@duration and ET


ET_ssd = ET_df.groupby(['cluster','activity']).apply(lambda x:
,→ (x['duration']-x['ET'])**2).sum(level=[0,1]).reset_index()
ET_ssd = ET_ssd.rename(columns={0:'ssd'})
# include ssd in EF_EFc that contains all activities as well as EF measure
EF_EFc_ET_ssd = pd.merge(left = EF_EFc , right=ET_ssd, right_on=['cluster','activity'],
left_on =['cluster','activity'], how='left')
EF_EFc_ET_ssd = EF_EFc_ET_ssd.rename(columns={0:'ssd'})
ET_df = ET_df.drop(['duration'],
,→ axis=1).groupby(['cluster','activity','ET']).size().reset_index()
EF_EFc_ET_ssd = pd.merge(left = EF_EFc_ET_ssd , right=ET_df.drop([0], axis=1),
,→ right_on=['cluster','activity'],
left_on =['cluster','activity'], how='left')
# activities that didn't have @@duration similarly we weren't able to calculate the ssd
EF_EFc_ET_ssd = EF_EFc_ET_ssd.fillna(0)
# calculate inverse stability measure; the higher the number the worse
EF_EFc_ET_ssd['ST'] = EF_EFc_ET_ssd.apply(lambda x: 0 if x['ssd']==0 else
,→ x['ssd']/(x['EF']*x['ET']), axis=1)
EF_EFc_ET_ST = EF_EFc_ET_ssd.drop(['ssd'], axis=1)
return(EF_EFc_ET_ST)

def priorFollowVar(clusters):
# Creating an empty Dataframe with column names only
PFv = pd.DataFrame(columns=['cluster', 'activity', 'PFv'])

for clusteri in range(len(clusters)):


# Get the paths of the log along with their count
# returns pairs activity_a, activity_b with count
paths4act = paths_filter.get_paths_from_log(clusters[clusteri])
paths4act_df = pd.Series(paths4act).to_frame('count').reset_index()
paths4act_df[['activityStart', 'activityEnd']] = paths4act_df['index'].str.split(',',
,→ n=1, expand=True)
paths4act_df = paths4act_df.drop(['index'], axis=1)
paths4act_df.sort_values(by=['activityStart','activityEnd'], inplace=True)
# for activity_a get all possible activity_b's
# calculate sum of all counts eg. {x_1, y_1,10} {x_1, y_2,30} for x_1 sum is 40
activityStartSum =
,→ paths4act_df[(paths4act_df['activityEnd']!=paths4act_df['activityStart'])]
activityStartSum =
,→ activityStartSum.groupby(by=['activityStart']).sum().groupby(level=[0]).cumsum().reset_index()
activityStartSum = activityStartSum.rename(columns={'count':'sum'})
# get maximum of those counts eg. {x_1, y_1,10} {x_1, y_2,30} for x_1 max is 30
activityStartMax =
,→ paths4act_df[(paths4act_df['activityEnd']!=paths4act_df['activityStart'])]
activityStartMax
,→ =activityStartMax.groupby(by=['activityStart']).max().reset_index().drop(['activityEnd'],
,→ axis=1)
activityStartMax= activityStartMax.rename(columns={'count':'max'})
# get number of pairs eg. {x_1, y_1,10} {x_1, y_2,30} for x_1 ndist is 2
activityStartNdist =
,→ paths4act_df[(paths4act_df['activityEnd']!=paths4act_df['activityStart'])]
activityStartNdist =
,→ activityStartNdist.groupby(by=['activityStart']).size().reset_index(name='counts')
activityStartNdist= activityStartNdist.rename(columns={'counts':'ndist'})
activityStartdf = pd.merge(left = activityStartNdist , right=activityStartSum,
,→ on=['activityStart'], how='inner')
activityStartdf = pd.merge(left = activityStartdf , right=activityStartMax,
,→ on=['activityStart'], how='inner')
activityStartdf['PFvstart'] = activityStartdf.apply(lambda x: (1.0/x['ndist']) *
,→ (x['max']/x['sum']) , axis=1)
# for activity_b get all possible activity_a's
activityEndSum =
,→ paths4act_df[(paths4act_df['activityEnd']!=paths4act_df['activityStart'])]
activityEndSum
,→ =activityEndSum.groupby(by=['activityEnd']).sum().groupby(level=[0]).cumsum().reset_index()
activityEndSum = activityEndSum.rename(columns={'count':'sum'})
activityEndMax =
,→ paths4act_df[(paths4act_df['activityEnd']!=paths4act_df['activityStart'])]
activityEndMax
,→ =activityEndMax.groupby(by=['activityEnd']).max().reset_index().drop(['activityStart'],
,→ axis=1)

54 Chapter A Maria Gkotsopoulou


Unleashing Robotic Process Automation Through Process Mining

activityEndMax= activityEndMax.rename(columns={'count':'max'})
activityEndNdist =
,→ paths4act_df[(paths4act_df['activityEnd']!=paths4act_df['activityStart'])]
activityEndNdist
,→ =activityEndNdist.groupby(by=['activityEnd']).size().reset_index(name='counts')
activityEndNdist= activityEndNdist.rename(columns={'counts':'ndist'})
activityEnddf = pd.merge(left = activityEndNdist , right=activityEndSum,
,→ on=['activityEnd'], how='inner')
activityEnddf = pd.merge(left = activityEnddf , right=activityEndMax,
,→ on=['activityEnd'], how='inner')
activityEnddf['PFvend'] = activityEnddf.apply(lambda x: (1.0/x['ndist']) *
,→ (x['max']/x['sum']) , axis=1)
# combine activityStartdf and activityEnddf
# outer join since some activities may only be starting ones or ending ones
PFv_df = pd.merge(left = activityStartdf.drop(['ndist','sum','max'], axis=1) ,
right=activityEnddf.drop(['ndist','sum','max'], axis=1),
right_on=['activityEnd'], left_on =['activityStart'], how='outer')
PFv_df = PFv_df.fillna(0)
PFv_df['PFv'] = PFv_df.apply(lambda x: x['PFvstart']+x['PFvend'] , axis=1)
PFv_df = PFv_df.drop(['activityEnd','PFvstart','PFvend'], axis=1)
PFv_df= PFv_df.rename(columns={'activityStart':'activity'})
PFv_df['cluster'] = clusteri
# append to result df
PFv = pd.concat([PFv, PFv_df], sort=False)
return(PFv)

def getIndicators(clusters, activityKey):


############## Execution Frequency & case ############
print("Calculating Execution Frequency & Execution Frequency:case")
EF_EFc = execFreq(clusters, activityKey)
############## Execution Time & Inverse Stability ############
print("Calculating Execution Time & Inverse Stability")
indicators = execTime(clusters, EF_EFc)
############## Prior/follow variability ############
print("Calculating Prior/follow variability")
PFv = priorFollowVar(clusters)
############## all indicators ############
indicators = pd.merge(left = indicators , right=PFv, right_on=['cluster','activity'],
left_on =['cluster','activity'], how='left')
indicators = indicators.fillna(0)
return(indicators)

def generatePlots(indicators):
plotTitles = ['Execution Frequency', 'Execution Frequency: case','Execution Time',
'Inverse Stability', 'Prior Follow Variability']
plotFilenames = ['ExecutionFrequency', 'ExecutionFreqCase', 'ExecutionTime',
'InverseStability', 'PriorFollowVar']
############## plot ############
for i,j in zip(range(2, indicators.shape[1]), range(len(plotTitles))):
plt.style.use('seaborn-whitegrid')
plt.figure()
xindex = np.arange(len(indicators))
colname = indicators.columns[i]
indicators.sort_values(by=[colname], inplace=True)
plt.bar(xindex.astype('U'), indicators[colname], color = 'indigo',width=1)
plt.xticks(np.arange(0, len(indicators), 50))
plt.title(plotTitles[j])
plt.xlabel('Activities')
plt.savefig('plots/'+plotFilenames[j]+'.png');

def main():
## input parameter data
filexes = sys.argv[1:][0]
############## import ############
path = ".."
os.chdir(path)
log = xes_import_factory.apply(os.path.join(os.getcwd(), "data", filexes))
print("Event log loaded with number of traces:", len(log))
############## trace clustering ############
# Count the arguments excluding the filename, first is necessary
activityKey = sys.argv[1:][1]
if (len(sys.argv) - 2) > 1:

Chapter A Maria Gkotsopoulou 55


Unleashing Robotic Process Automation Through Process Mining

clusterParams = {"str_ev_attr" : activityKey, "str_tr_attr" :sys.argv[1:][2]}


else:
clusterParams = {"str_ev_attr" : activityKey}
clusters = clusterlog(log,clusterParams)
############## calculate indicators ############
indicators = getIndicators(clusters, activityKey)
print("Generating Indicator plots and saving the files")
############## plot ############
generatePlots(indicators)

if __name__ == "__main__":
main()

56 Chapter A Maria Gkotsopoulou

You might also like