Unleashing Robotic Process Automation
Unleashing Robotic Process Automation
Master Thesis
Supervisor
Author Josep Carmona Vargas
Maria Gkotsopoulou Department
Computer Science
Maria Gkotsopoulou
Abstract
Robotic Process Automation (RPA) is an umbrella term for tools that run on an
end user’s computer, emulating tasks previously executed through a user interface by
means of a software robot. Many studies have highlighted potential benefits of RPA,
by bridging artificial intelligence and business process management, it provides the
promise of robots as a virtual workforce that perform tasks leading to improvements
in efficiency for business processes. However, most commercial tools do not cover
the Analysis phase of the RPA lifecycle. The lack of automation in such a phase
is mainly reflected by the absence of technological solutions to look for the best
candidate processes of an organization to be automated. Based on process mining
techniques, we seek to define an automatable indicator system used to guide and
direct companies that seek to better prioritize their RPA activities.
Keywords: Process mining, Robotic Process Automation
2
Acknowledgements
3
Contents
1 Introduction 5
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Preliminaries 8
2.1 Mathematical Notations . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Algorithms for Event Data . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Trace clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Related Work 19
5 Implementation 32
5.1 Code implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 Evaluation 35
6.1 BPI Challenge 2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2 Artifical UI logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7 Conclusion 43
A Appendix 52
A.1 Code implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4
1 | Introduction
5
Unleashing Robotic Process Automation Through Process Mining
[31]. This is often caused due to false notion of process complexity and lack of
transparency on how processes are being executed [54].
1.1. Motivation
A set of research questions has been created in order to formalize what is needed to
address in order to fulfill the purposes of this study. The research questions are as
follows:
• RQ1: How to assess the suitability of a process task extracted from an event
log to support the Analysis phase of an RPA project?
• RQ2: How to define an objective and generalizable methodology to evaluate
the quantifiable decision support framework of process task selection in RPA?
In this section, we present an outline for the remainder of this thesis. In Chapter 2,
general definitions and concepts related to this thesis work are introduced. In Chap-
ter 3 related work be that as it may in form of case studies, prototypes or systems
related to identification, elicitation and design of the to-be-automated tasks in an
RPA project. In Chapter 4 the general framework including the set of indicators to
be used in the Analysis phase are presented. Chapter 5 makes a brief presentation
of the system design, as well as, the implementation decision taken. Chapter 6 con-
ducts evaluation both on real-life event data and on synthetic event data. Chapter
7 concludes this thesis and presents possible limitations.
In this chapter, we introduce multiple definitions and notations that we use in this
thesis. First, basic mathematical concepts are presented. Second, concepts related
to process mining are introduced. Finally, definitions about trace clustering are
presented.
8
Unleashing Robotic Process Automation Through Process Mining
∀x ∈ X : f (x) ∈ Y . The range of f is the set of the images of all elements in the
domain, i.e., range(f ) = {f (x)|x ∈ dom(f )}.
A partial function g ∈ X 9 Y is a function whose domain is a subset of X, i.e.,
dom(g) ⊆ X , which means that g does not need to be defined for all values in X.
A binary function R ∈ X × X is referred to as an endorelation on X. For such
endorelations, the following properties are of interest in the context of this thesis.
• R is reflexive, if and only if,∀x ∈ X(xRx).
• R is irreflexive, if and only if ∀@x ∈ X(xRx).
• R is symmetric, if and only if, ∀x, y ∈ X(xRy ⇒ yRx)
• R is antisymmetric, if and only if, ∀x, y ∈ X(xRy ⇒ ¬yRx)
• R is transitive, if and only if, ∀x, y, z ∈ X(xRy ∧ yRz ⇒ xRz)
A relation ⊆ X × X, alternatively written (X, ⊆), is a partial order, if and only
if, it is reflexive, antisymmetric and transitive. A relation ≺⊆ X × X, alternatively
written (X, ≺), is a strict partial order, if and only if, it is irreflexive, antisymmetric
and transitive.
Definition 2.1.4 (Function Projection). Let f ∈ X 9 Y be a (partial) function
and Q ⊆ X. f Q is the function projected on Q : dom(f Q ) = dom(f ) ∩ Q and
f Q (x) = f (x) for x ∈ dom(f Q ) .
In some cases, we are interested in a particular element of a tuple. To this end we
define a projection function, i.e. given 1 ≤ i ≤ n, π n : X1 × . . . × Xn → Xi , s.t.
π i ((x1 , . . . , xi , . . . , xn )) = xi , e.g. for (x, y, z) ∈ X × Y × Z, we have π 1 ((x, y, z)) =
x, π 2 ((x, y, z)) = y, π 3 ((x, y, z)) = z. In the remainder of this thesis, we omit the
explicit surrounding braces of tuples, when applying a projection on top of them,
i.e. we write π 1 (x, y, z) rather than π 1 ((x, y, z)).
Definition 2.1.5 (Multi-set). A multiset (or bag) generalizes the concept of a set
and allows elements to have a multiplicity, i.e. degree of membership, exceeding
one.
A multiset b is a function b : X → N0 , which maps each element to its number of
occurrences in the multiset, i.e., b(x) denotes the number of times element x ∈ X
appears in b. Given a set X,
P B(X) := {b|b : X → N0 } denotes all possible multisets
over set X, with size |B|= x∈X B(x).
For example, given set X = {x1 , x2 }, [x21 , x2 ] is a multiset containing element x1 two
times and element x2 once. Note that sets are written using curly brackets, while
multisets are written using square brackets.
Given two multisets M ∈ A → N+ , M 0 ∈ B → N+ the notion of disjoint union is
the following (to ease the readability we assume that X(c) = 0 if c ∈
/ Dom(X) ):
]
M =X X 0 = {(A ∪ B) → N+ |∀c ∈ A ∪ B, M (c) = X(c), X 0 (c))}
is mutual exclusive (no element is in more than one subset) and jointly exhaustive
(every element is in one subset). Mathematically P(X) ⊆ P(X), s.t.
• ∪X 0 ∈P(X) X 0 = X, ∀X 0 ∈ P(X)(X 0 6= ∅)
• ∀X 0 6= X 00 ∈ P(X), X 0 ∩ X 00 = ∅
Definition 2.1.7 (Sequence). Sequences represent enumerated collections of ele-
ments which additionally, like multisets, allow its elements to appear multiple times.
However, within sequences we explicitly keep track of the order of an element.
A sequence of length n over set X is an enumerated ordered collection of elements,
which is defined as a function σ : {1, 2, . . . , n} → X. we write σ = hσ1 , σ2 , . . . , σn i
denoting a sequence of length n, i.e. |σ|= n.
• hi denotes the empty sequence.
• For a given set X, X ∗ is the set of all non-empty finite sequences over X plus
the empty sequence.
For example, X = {x1 , x2 }, then X ∗ = {hi, hi, hx1 i, hx2 i, hx1 , x2 i, hx1 , x2 , x1 i, . . . }
• σ1 · σ2 denotes the concatenation of sequences σ1 and σ2 , e.g., ha, b, ci · hd, ei =
ha, b, c, d, ei.
v1
Definition 2.1.8 (Vector). We define ~v ∈ Rn = ... s.t., v1 , . . . , vn ∈ R as n-
vn
dimensional column vector. And ~v ∈ R = [v1 , . . . , vn ] denotes the transpose of it,
T n
Note that the same trace may appear multiple times in an event log.
Definition 2.2.4 (Event log). An event log is a set of cases L ⊆ C such that each
event occurs only once, i.e., for any two cases c1 , c2 ∈ C, c1 6= c2 : Ec1 ∩ Ec2 = ∅.
Alternatively, L ∈ P(C), s.t.∀c ∈ L, πt (c) 6= hi.
If activity projection is used, then a trace corresponds to a sequence of activities.
In such a scenario, two or more cases can correspond to the same activity sequence.
Therefore, an event log is a multi-set of traces.
A trace σ ∈ E ∗ is sequence of events. Let Σ = E ∗ be the universe of traces. An
event log L is a multiset of traces, i.e. L ∈ B(Σ).
Definition 2.2.5 (Simple Log, Trace Variant). A simple event log EL ∈ B(A∗ )
represents the control-flow perspective of event log L. This view is achieved by
projecting each trace in an event log on its corresponding activities, i.e. given σ ∈ L
we apply πa∗ (σ). Thus we transform each trace within the event log into a sequence
of activities. ]
EL = πa∗ (σ)
σ∈traces(L)
where traces(L) denotes the collection of the traces described by the event log.
Thus a simple event log is a multiset of sequences of activities, as multiple traces
of events are able to map to the same sequence of activities. As such, each member
of a simple event log is referred to as a trace, yet each sequence σ ∈ A∗ for which
EL(σ) > 0 is referred to as a trace-variant. Hence, in case we have EL(σ) = k,
there are k traces describing trace-variant σ, and, the cardinality of trace-variant σ
is k.
The goal of process mining is to automatically produce process models that accu-
rately describe processes by considering only an organization’s records of its opera-
tional processes [22]. Such records are typically captured in the form of event logs,
consisting of cases and events related to these cases. Using these event logs pro-
cess models can be discovered. The task of constructing a process model from an
event log is termed process discovery [5]. Many such process discovery techniques
have been developed, producing process models in various forms, such as Petri nets,
BPMN-models, EPCs, YAWL-models [22].
Definition 2.2.6 (Petri Nets, Workflow Nets and Block-structured Workflow Nets
[63]). A Petri net is a bipartite graph containing places and transitions, intercon-
nected by directed arcs. A transition models a process activity, places and arcs
model the ordering of process activities. We assume the standard semantics of Petri
nets here, see [78]. A workflow net is a Petri net having a single start place and a
single end place, modeling the start and end state of a process. Moreover, all nodes
are on a path from start to end [5]. A block-structured workflow net is a hierarchical
workflow net that can be divided recursively into parts having single entry and exit
points.
Model of Figure 2.1 describes the behaviors of users rating an app [17]. First, users
start the form (s). They give either a good (g) or a bad (b) mark attached to a
comment (c) or a file (f ). Bad ratings get apologies (a), a silent transition (τ )
enables to avoid them. Finally, users can donate to the developers of the app (d).
The ordering of events within a case is relevant, while the ordering of events among
cases is of no importance. Logs are analyzed for causal dependencies, e.g., if a task
is always followed by another task, it is likely that there is a causal relation between
both tasks. To analyze these relations, the following notations is used.
Definition 2.2.7 (Log-based ordering relations [5]). Let L be an event log over A,
i.e. L ∈ B(A∗ ). Let a, b ∈ A:
• a >L b if and only if there is a trace σ = ht1 , t2 , . . . , tn i and i ∈ {1, . . . , n − 1}
such that σ ∈ L and ti = a and ti+1 = b
• a →L b if and only if a >L b and b ≯L a
• a#L b if and only if a ≯L b and b ≯L a
• akL b if and only if a >L b and b >L a
Process discovery is challenging because the derived model has to be fitting, pre-
cise, general, and simple [90]. We give a general definition of the process discovery
algorithm.
Definition 2.2.8 (Process Discovery Algorithm [73]). Let L be an event log over
activities A and M be the universe of process models. Then process discovery
algorithm D is a function that takes event log as an input and returns a process
model over activities A , i.e., D : L → M
The majority of existing conventional process discovery algorithms share a common
underlying algorithmic mechanism. As a first step, the event log is transformed
into a data abstraction of the input event log on the basis of which they discover a
process model. A commonly used data abstraction is the directly follows relation.
Numerous discovery algorithms use it as a primary/supporting artefact, to discover
a process model [111]. Directly follows relation defines a multiset that contains all
direct precedence relations among the different activities present in the event log
[26]. We write (a, b) or a > b if activity a is directly followed by activity b.
Definition 2.2.9 (Direct Follows Relation [111]). Let L ∈ B(A∗ ) be an event log,
the direct follows relation>L , is the multiset A × A,i.e., >L ∈ B(A × A), for which
given activity a1 , a2 ∈ A:
n
X
>L (a1 , a2 ) = |{1 ≤ i ≤ |σ|−1|σ(i) = a1 ∧ σ(i + 1) = a2 }|
σ∈L
where >L denotes the multiset of all possible directly follows relations in event log
L and >L (a1 , a2 ) denotes the number of occurrence of directly follows relations
(a1 , a2 ) in the log.
Process discovery algorithms such as the Inductive Miner [63], the Heuristics Miner
[104], [105], the Fuzzy Miner [45], and most of the commercial process mining tools
use (amongst others) the directly follows relation as an intermediate structure [111].
As per [63] the directly-follows relation can be expressed in the directly-follows graph
of a log L, written G(L). It is a directed graph containing as nodes the activities
of L. An edge (a, b) is present in G(L) if and only if some trace h. . . a, b . . . i exists
in L. A node of G(L) is a start node if its activity is in Start(L), with definition
Start(G(L)) = Start(L). Similarly for end nodes in End(L), and End(G(L)). The
definition for G(M ) is similar. Start(L), Start(M ) and End(L), End(M ) denote the
sets of activities with which log L and model M start or end.
Definition 2.2.10 (Process tree[63]). A process tree is a compact abstract rep-
resentation of a block-structured workflow net: a rooted tree in which leaves are
labeled with activities and all other nodes are labeled with operators. A process
tree describes a language, an operator describes how the languages of its subtrees
are to be combined.
Assume a finite alphabet A of activities and a set ⊗ of process tree operators to be
given. Symbol τ ∈ A denotes the silent activity.
– α with α ∈ A ∪ τ is a process tree;
– Let M1 , . . . , Mn with n > 0 be process trees and let and ⊗ be a process tree
operator, then ⊗(M1 , . . . , Mn ) is a process tree
Operator × means the exclusive choice between one of the subtrees, → means the
sequential execution of all subtrees, means the structured loop of loop body M1
and alternative loop back paths M2 , . . . , Mn , and ∧ means a parallel (interleaved)
execution.
Example process trees and their languages [62]:
The idea for the Inductive Miner [63] algorithm is to find in G(L) structures that
indicate the ’dominant’ operator that orders the behaviour. Each of the four op-
erators ×, →, , ∧ has a characteristic pattern in G(L) that can be identified by
finding a partitioning of the nodes of G(L) into n sets of nodes with characteristic
edges in between [63]. Given a set ⊗ of process tree operators, Leemans et al. [63]
define a framework B to discover a set of process models using a divide and conquer
approach. Given log L, B searches for possible splits of L, such that these logs
combined with an operator ⊗ can produce L again. It then recurses on the found
divisions and returns a cartesian product of the found models.
Process discovery is a largely unsupervised learning task in nature due to the fact
that event logs rarely contain negative events to record that a particular activity
could not have taken place [103]. Various studies illustrate that process discovery
techniques experience difficulties to render accurate and interpretable process models
out of event logs stemming from highly flexible environments [20], [41], [46], [98].
Since flexible environments environments typically allow for a wide spectrum of
potential behavior, the analysis results are equally unstructured [88].
Different approaches have been proposed to cope with the issue of high variety of
behavior that is captured in certain event logs. Next to event log filtering, event log
transformation [19] and tailor-made discovery techniques such as Fuzzy Mining [45],
trace clustering can be considered a versatile solution for reducing the complexity
of the learning task at hand [103]. The Fuzzy Miner [45] is a process discovery
technique developed to deal with complex and flexible process models. It connects
nodes that represent activities with edges indicating follows relations, taking into
account the relative significance of follows/precedes relations and allowing the user
to filter out edges using a slider. However, the process models obtained using the
Fuzzy Miner lack formal semantics [92]. If the event log shows a lot of behavioral
variability, the process model discovered for all traces together is too complex to
comprehend; in those cases, it makes sense to split the event log into clusters and
apply discovering techniques to the log clusters, thus obtaining a simpler model for
each cluster [68].
By dividing traces into different groups, process discovery techniques can be applied
on subsets of behavior and thus improve the accuracy and comprehensibility [103].
Given this diversity of an event log, it is a valid assumption that there are a number of
tacit process variants hidden within one event log, each significantly more structured
than the complete process [88]. Tacit processes will usually not be explicitly known,
however, the similarity of cases can be measured and used to divide the set of cases
into more homogeneous subsets.
Several techniques have been proposed in the last decade for trace clustering [20],
[21], [37], [43], [49], [88], [103]. In literature [17] they are partitioned into vector
space approaches [43], [88], context aware approaches [20], [21] and model-based
approaches [37], [49], [103]. Feature vector approaches require one to define feature
sets and transform traces in the event log to vectors representing the feature sets.
Figure 2.3: Using trace clustering in business process model discovery [89]
Definition 2.2.11 (Trace Clustering [17]). Given a log L, a trace clustering over L
is a partition over a (possibly proper) subset of the traces in L.
Feature sets describe the different attributes recorded in each case. Song et al.
[88] defined feature sets regarding various perspectives of process instances, i.e.,
the control-flow, resource, organization, and time perspectives. In their approach
traces are characterized by profiles, where a profile is a set of related items which
describe the trace from a specific perspective. Therefore, a profile with n items can
be considered as a function, which assigns to a trace a vector hi1 , i2 , . . . , in i. These
resulting vectors can subsequently be used to calculate the distance between any two
traces, using a distance metric [88]. To this extent, the activity profile defines one
item per type of activity (i.e., event name) found in the log. Measuring an activity
item is performed by simply counting all events of a trace, which have that activity’s
name. Moreover, the transition profile could be considered, in which case the items
in this profile are direct following relations of the trace. For any combination of two
activity names hA, Bi, this profile contains an item measuring how often an event
with name A has been directly followed by another event name B. The distance
between two of such vectors can be calculated using a distance function, e.g., the
Euclidean, Hamming or Jaccard distance [88].
Definition 2.2.12 (Distance Measures). The profile can be represented as a n-
dimensional vector where n indicates the number of items extracted from the process
log. Thus, case cj corresponds to the vector hij1 , ij2 , . . . , ijn i, where each ijk denotes
the number of appearance of item k in the case j [88]. Then, the Euclidean distance
[33], Hamming distance [48], and Jaccard distance [91] are defined as follows.
– Euclidean distance (cj , ck ) =
pPn
2
l=1 |ijl − ikl |
P δ(i ,i )
– Hamming distance (cj , ck ) = nl=1 jln kl where
0, if (x > 0 ∧ y > 0) ∨ (x = y = 0)
δ(x, y) = (2.1)
1, otherwise
Pn
l=1 ijl iklP
– Jaccard distance (cj , ck ) = 1 − Pn 2
Pn 2 n
l=1 ijl + l=1 ikl + l=1 ijl ikl
Similar to Yu et. al. [110] we chose DBScan. The advantages of DBScan are that it
does not require one to specify the number of clusters as opposed to K-Means. In
addition, DBScan can find arbitrarily shaped clusters, whereas this is convex for K-
Means. Lastly, DBScan has a notion of noise since it can detect and ignore outliers
from the dataset.
Definition 2.2.13 (DBScan [100]). The DBScan (density-based spatial clustering
with added noise) problem takes as input n points P = {p1 , . . . , pn }, a distance
function d, and two parameters and minPts [36]. A point p is a core point if and
only if |{pi |pi ∈ P, d(p, pi ) ≤ }|≥ minPts. We denote the set of core points as C.
DBScan computes and outputs subsets of P, referred to as clusters. Each point in C
is in exactly one cluster, and two points p, q ∈ C are in the same cluster if and only if
there exists a list of points p̄1 = p, p̄2 , . . . , p̄k−1 , p̄k = q in C such that d(p̄i−1 , p̄i ) ≤ .
For all non-core points p ∈ P \ C, p belongs to cluster Ci if d(p, q) ≤ for at least
one point q ∈ C ∩ Ci . Note that a non-core point can belong to multiple clusters.
A non-core point belonging to at least one cluster is called a border point and a
non-core point belonging to no clusters is called a noise point. For a given set of
points and parameters and minPts, the clusters returned are unique.
The disadvantages of DBScan are that it is sensitive to parameter choice. If
is too high, dense clusters are merged together [52]. In addition, DBScan is not
deterministic, as opposed to K-Means. Border points that are density-reachable
from more than cluster can be part of either cluster depending on the order of the
data is processed [36].
In summary, the approach used in this study consists in abstracting the features
of the traces from event logs into profiles such as the activity profile, case or event
attributes profile. Then, trace clustering can be applied using the DBScan algorithm
and an appropriate distance measure. To this extent, trace clustering techniques
divide the raw event log into Lv sublogs, |v| partitions of the event log L as per the
number of clusters obtained from the DBScan execution, where each sublog contains
the traces with similar behaviors.
Business process management [34] (BPM) revolves around the effective scheduling,
orchestration and coordination of the different activities and tasks that comprise a
(business) process. Nowadays, Business processes (BPs) are enacted in many com-
plex industrial (e.g., manufacturing, logistics, retail) and non-industrial (e.g., emer-
gency management, healthcare, smart environments) domains through a dedicated
generation of information systems, called Process Management Systems (PMSs) [81].
Organizations, currently faced with the challenge of keeping up with the increasing
digitization, seek to adapt existing business models and to respectively improve
the automation of their business processes [69]. Within BPM, the challenge of ac-
curately automating a business process is known as Business Process Automation
(BPA) [86]. Techniques that allow us to apply automation on the user interface (UI)
level of computer system have been developed for decades, though, it’s only recently
that they were adopted Robotic Process Automation (RPA) [93]. Enriquez et al.
[35] put forward that RPA could be considered a process-oriented optimization and
management strategy with a clear multidisciplinary nature because this strategy
involves multiple stakeholders (Subject Matter Experts – SME –, Business Analysts
– BA –, Software Developer – SD –, etc.). The entry barrier of adopting RPA in
processes that are already in place, is lower compared to conventional BPA [107]
since they operate on the user interface level, rather than on the system level. Most
applications of RPA have been done for automating tasks of service business process,
like validating the sale of insurance premiums, generating utility bills, paying health
care insurance claims, keeping employee records up-to date, among others [57].
Two broad types of RPA use cases are generally distinguished [66]:
• attended: An attended bot assists a user in performing her daily tasks, it can
be interrupted, paused or stopped at any time. They run on local machines
and manipulate the same applications as a user. They can be used to automate
routines that require dynamic inputs, human judgement or when the routine
is likely to have exceptions.
• unattended: Unattended bots are used for back-office functions. They usu-
ally run on organization’s servers and are suitable for executing deterministic
routines where all inputs and outputs are known in advance, and all executions
paths are well defined.
Telefonica comparison of RPA versus Business Process Management Systems (BPMS),
used in automation, reveals that RPA for 10 automated processes would pay back
in 10 months, in contrast, with the BPMS it was going to take up to three years to
payback [59]. In addition, a number of case studies, in companies such as O2 and
Vodafone, have shown that RPA technology can lead to improvements in efficiency
for BPs involving routine work ([8], [40], [59]).
Determining which process steps (also called routines) are good candidates to be
automated not only is the first task to be performed while conducting an RPA
project, but also it’s one of the key challenges ([7], [8], [35]). So far, most of the
19
Unleashing Robotic Process Automation Through Process Mining
research has focused on the establishment of criteria and guidelines ([14], [38], [40],
[57], as means to support organizations in addressing this challenge. According to
most criteria, the best candidates to be subjected to RPA projects are back office
areas [8], [40].
• Rule-based processes with high volume of manual tasks and handling time [40]
– Highly frequent tasks ([38], [58], [79])
– Low complexity of tasks [58]
– Degree to which the process is rule-based, Number of process steps [14]
– Repetitive process tasks with a high volume of transactions, sub-tasks,
and frequent interactions between different systems or interfaces [11]
– Execution Frequency (EF): count of each activity belonging to the
same process, Execution Time (ET): Average execution time of a pro-
cess task [101]
• Processes with fixed procedures, that are standardized and mature [40]
– Low level of exceptions, Involving an enclosed cognitive scope, Susceptible
to human errors [38]
– Processes that involve routine tasks, structured data and deterministic
outcomes [8]
– Degree of human intervention, Structuredness of data, Process standard-
ization, Degree of process maturity, Number of (known) exceptions [14]
– Streamlined process tasks with a-priori knowledge of possible events and
outcomes of process task executions [11], [108]
– Standardization (SD): number of different prior and following activi-
ties of each activity in a given process instance, Stability (ST): normal-
ized sum of the squared differences of the execution times of an activity
in a process instance and the average duration of that activity [101]
– Process tasks with a low probability of exception and a high predictability
of outcomes [9], [71]
• Automation Rate (AR) [101]; Process tasks with a small number of steps
that are already automated and offer less significant economic benefits [40]
– Degree of process digitization, Degree of similarity of environments, La-
bor intensity, Number of systems involved, Frequency of system related
changes, Risk-proneness [14]
– Failure Rate (FR): Throwbacks ratio of process tasks, i.e. unusual and
repetitive (partial) tasks until completion [101]
– Processes where a technical integration via the backend is too costly
and/or impossible [40]
Wanner et al. [101] system of indicators (highlighted in boldface) falls inline with
the current body of literature, as it was defined based on a literature analysis and
before they were sent out to the supplier [28]. Finally, since an RPA initiative is not
a one-time project but requires continuous monitoring of results process mining can
provide fast and powerful insights into RPA’s impact on process performance KPIs
such as throughput times [40].
On the other hand, most of the body of research does not consider the Analysis
phase; i.e. an analysis of the context to determine which processes – or parts of
them – are candidates to be robotized [79], in isolation, but in tandem with the
Design phase, that entails detailing the set of actions, data flow, activities, etc.,
that must be implemented in the RPA process [35]. Process mining techniques
currently expect as input process execution data (e.g. records of process activities
start and completion) whereas for RPA we need to use UI logs (clickstreams, keylogs)
as input [65], which is on a much lower level of granularity. To this extent, there
has been a shift from working on event logs to working on UI logs, when applying
process mining techniques. There has been development of techniques to analyze
User Interface (UI) logs in order to discover repetitive routines that are amenable
to automation via RPA tools ([18], [79]).
Discovering RPA routines is related to the problem of Automated Process Discovery
(APD) [12], which has been studied in the field of process mining. Jimenez-Ramirez
et al. transform screen, mouse and key events recordings of back-office staff using
information system (IS) into a standardized event log [44] including UI information
using a series of image-analysis algorithms [79]. Then, the generated UI log is
used as input of a process discovery algorithm to obtain a process model, which is
subsequently reviewed by a business analyst using the ProM interface [95]. Their
proposal, thus, represents a tool-supported method, with the potential of shortening
the time required to discover the relevant parts of processes that can be robotized,
since it facilitates the cleansing of the initial models from noise, by means of focusing
on the frequency of paths metric. They evaluated their method via a use case
application in a major Spanish bank and a telecommunication company, whereby
they compared it against the conventional analysis and design activities for RPA
scenarios that represents the a priori model, using measures such as # paths a priori
and # paths final. The fact that the basis of their evaluation is formed by these
two real-life processes considering a log regarding a single-user interaction presents
limitations.
In Bosco’s et al. [18] proposed method, an analysis of UI logs yields a discovery of
routines that are fully deterministic and hence automatable using RPA tools. Their
method consists of compressing the UI log into a Deterministic Acyclic Finite State
Automaton (DAFSA) [83] and subsequently, extract the flat-polygons (the candi-
date automatable routines), which represent actions sequences of different length.
The use of a lossless representation of event logs (DAFSAs) to discover candidate
automatable routines, as opposed to an automatically discovered process model is
justified by the fact that APD focuses on discovering control-flow models, without
data flow. Moreover, APD approaches generalize and under-approximate the log’s
behavior, in contrast, they seek to discover only sequences of actions that have been
observed in a UI log. Specifically, given a UI log, their method outputs a set of
routine specifications, consisting of an activation condition and a sequence of action
specifications. To this extent, in the discovered sequences of actions (routines), the
inputs of each action in the sequence (except the first one) can be derived from
the data observed in previous actions. To evaluate their method they generated
nine artificial UI logs designed with the CPN Tool [51], each log containing a differ-
ent number of automatable (sub)routines of varying complexity [13]. The only log
where their approach failed to discover routines is one that contains loops, due to
the fact that DAFSAs do not capture loops. The most significant limitations to their
technique is the fact that it can only discover perfectly sequential routines and it’s
inability to deal with noise, in which case, it will either discover only sub-routines
of an otherwise automatable routine, or not discover the routine at all.
Leno et al. [65] in accordance with the shift from working with event logs to working
on UI logs, while applying process mining techniques, aim to develop the techniques
for data-aware discovery of procedural and declarative process models, that can be
used for RPA bots training. This is because traditional APD techniques discover
control-flow models, while, in the context of RPA, executable specifications that
capture the mapping between the outputs and the inputs of the actions performed
during a routine, need to be discovered. Moreover, the discovered process model
needs to relate post-conditions of a task with pre-conditions, and more generally, it
needs to discover correlated conditions [65]. However, they make the case that there
is an absence of tools capable of recording UI logs that (1) can be given as input
to process mining tools and (2) contain information at the granularity level suitable
for RPA [67]. They specify that a UI log can describe multiple executions of a task;
while, one execution of a task is called task trace and it contains a sequence of actions
required to complete the task [66]. To this extent, they develop a tool, called Action
Logger, that meets functional requirements necessary to generate UI logs amenable
for further RPA-related analysis with process mining [67]. Furthermore, motivated
by the RPA use case of automating data transfers across multiple applications, they
addresses the problem of analyzing UI logs, in order to discover routines, where
a user transfers data from one spreadsheet or (Web) form to another [66]. These
routines can be discovered from the set of input-output examples induced by the
routine’s execution, and, can be codified as programs that transform a data structure
into another. Their empirical evaluation, done on a dataset that they build using
the Action Logger tool [67], demonstrates that the proposed technique can discover
various types of transformations. Nevertheless, it has various limitations; (1) it
assumes that the output fields are entirely derived from input fields, (2) it assumes
that the UI log consists of traces, such that each trace corresponds to a task execution
and (3) it’s unable to discover conditional behavior, where the condition depends
on the value of an input field.
Furthermore, the approach presented in Gao et al. [39] aims at extracting rules from
UI logs that can be used to automatically fill in forms. This approach, however,
does not discover any data transformations [66]. Moving away from both event and
UI logs, Leopold et al. [69] use as input data, textual process descriptions and
apply supervised machine learning and natural language processing techniques to
classify the process tasks as 1) manual task, 2) user task (interaction of a human
with an information system) or 3) automated task. They evaluate their method
using repeated 10-fold cross validation on a set of 424 activities from a total 47
textual process descriptions. Even they present Their proposal has the shortcoming
of analyzing what is available in the documentation instead of the actual behavior
of the system [79].
In our study, we will draw our attention solely to the Analysis phase, thus, focusing
on the identification of tasks and/or process properties that are potentially eligible
to be automated using RPA. To this extent, we will use event log data, in order
not to assume any additional economic, as well as time investment, associated to
the recording of a UI log, suitable for the application of process mining techniques.
Following Wanner et al. [101] approach, we set out to define indicators to guide
companies towards the process tasks where they should direct their efforts, when it
comes to undertaking an RPA project. However, we do not introduce the concept
of activity type as they do. The indicator of Execution Frequency and Stability are
otherwise similar to their definitions and as such their names are also maintained.
However, in the case of the Execution Time, even if the naming is the same, con-
ceptually there are differences, as in our case there is a transaction type dependency
and a consistent trace condition to be met. Moreover, the Execution Frequency:case
and the Prior/follow variability indicators originate, but depart from the concepts
introduced with the definitions of Failure Rate and Standardization, respectively.
Also, contrary to Wanner et al. [101] we will make use of formal notation, as per
the concepts defined in Chapter Preliminaries. Finally, as a log preprocessing step
we apply trace clustering resulting to obtaining Lv sublogs on which the indicators
are defined.
24
Unleashing Robotic Process Automation Through Process Mining
model enhanced with data flow and finally, generating an executable RPA script
that implements the specification [66].
To facilitate the concept development and presentation, we will use a running exam-
ple throughout this Chapter. This running example is based on the BPI Challenge
2012 dataset [97]. This same dataset will be used for evaluation purposes, as well
(see Chapter Evaluation). The event log, used for the BPI Challenge 2012, contains
events related to the application process for a personal loan or overdraft within a
Dutch financial institute. The global process is defined over three sub-processes and
can be summarized as follows: a submitted loan/overdraft application is subjected
to some automatic checks, upon failure to pass it is declined. In addition, customers
are contacted by phone when additional information needs to be obtained, as well as,
for incomplete/missing information. The application can be declined upon assessing
the responses of eligible applicants to which offers were sent or without making any
offer. Posterior to the final assessment, the application is either approved and acti-
vated, declined, or cancelled. Furthermore, certain cases are considered suspicious
and a check signalling fraud is performed.
Each case contains a single case level attribute, AMOUNT_REQ, which indicates
the amount requested by the applicant. For each event, the extract shows the type
of event, life-cycle stage (Schedule, Start, Complete), a resource indicator and the
time of event completion.
~ L ∈ N|A|
Lv ⊆ L, is defined as a function A 0 , given by
i
where |Lv |aj denotes the number of occurrence of the activity aj in the sublog Lv
To illustrate, assume we have a simple log EL from which we obtain two sublogs
EL1 = [ha, bi, ha, b, ci] and EL2 = [ha, bi, ha, ci]. The set of all activities of the
complete event log L is {a, b, c}, so the activity vector for the sublog L1 is A~ L1 =
(2, 2, 1) and similarly A~ L2 = (2, 1, 1).
To this extent, Execution Frequency (EF (Lv )) is defined as the count of each activity
belonging to the same Lv sublog.
The feature vector regarding the set of all activities, named activity vector, is in the
form of the Parikh vector introduced in definition 2.1.9, defined here for illustration
purposes.
where |σv |aj denotes the number of occurrences of the activity aj in the trace variant
σv . While, itself trace variant σv is each sequence σ ∈ A∗ for which Lv (σ) > 0.
We calculate, for each sublog Lv ⊆ L, the frequency of variants and activities for
all unique variants that the activity is present. Further on, we obtain the weighted
sum as per the below definition.
v
P
{σv ∈traces(Lv ))|aj ∈σv } (|σv |aj ×L (σv ))
v
EFc (L ) = P v
, ∀aj ∈ A, Lv ∈ P(L)
{σv ∈traces(Lv ))|aj ∈σv } L (σv )
Consider simple sublog EL10 introduced in section 4.1. It contains six traces and
three trace variants. The first trace variant appears 4 times, while the other two
just once.
σ0 = hA_SUBMITTED, A_PARTLYSUBMITTED, W_Afhandelen leads2 ,
W_Beoordelen fraude, W_Afhandelen leads, W_Beoordelen fraude3 ,
A_DECLINED, W_Beoordelen fraudei
σ1 = hA_SUBMITTED, A_PARTLYSUBMITTED, W_Afhandelen leads4 ,
W_Beoordelen fraude, W_Afhandelen leads, W_Beoordelen fraude3 ,
A_DECLINED, W_Beoordelen fraudei
σ2 = hA_SUBMITTED, A_PARTLYSUBMITTED,
W_Afhandelen leads8 , A_CANCELLEDi
Then, we obtain the activity count and variant count for each activity aj .
A trace is a sequence of events, denoting for a case what activities were executed.
As per the definition 2.2.2 in chapter 2, transaction type is an attribute associated
to event that refers to the life-cycle of activities. Examples are schedule, start,
complete and suspend [6]. From Figure 4.2, we see that for b only the completion
of the activity instance is recorded, while activity instance a has two events. On the
other hand, activity instance c is first scheduled for execution, then the activity is
started and finally it completes.
Definition 4.3.1 (Consistent trace [23]). A trace is consistent if and only if each
start event has a corresponding completion event and vice versa.
With E as the set of all possible event identifiers, as per the definition 2.2.1 in
Chapter 2, then it is partitioned into two sets [75]:
Definition 4.3.2 (Service Time [75]). Given E,E S and E C , we define function st ∈
E → T maps events onto the duration of the corresponding activity, i.e., the service
time (the time a resource is busy with a task [64]). We assume that there is a one-
to-one correspondence between E S and E C i.e., any es ∈ E S corresponds to precisely
one event ec ∈ E C and vice versa. The service time of these events are equal, i.e.,
st(es ) = st(ec ) = ēc − ēs .
Execution Time (ET (Lv )) is the average duration of an activity aj belonging to the
same sublog Lv ⊆ L. We calculate it for each sublog Li by summing the service
time over all events belonging to an activity aj and diving it by the cardinality of
the set of events belonging to an activity aj . To this extent, only consistent trace
will yield a value higher than 0.
As per the Table 4.5, for sublog L8 and for activity W_Valideren aanvraag the
service time is 1454.816, while for A_REGISTERED it is 0. Then, the Execution
time of W_Valideren aanvraag for sublog L8 is obtained by calculating the average
duration.
Similar to Wanner et al., we introduce the indicator of the Inverse Stability based
on the variance of its execution times, since our candidate tasks should be fixed and
standardized. A high Inverse Stability would signal non-deterministic outcomes and
low predictability. We measure the Inverse Stability (ST −1 ) of an activity based
on the sum of the squared differences of the execution times of an activity in an
execution and the average duration of that activity and normalizes the results based
on the number of activity executions.
Then, we can define the Direct Succession Vector similar to definition 4.2.1 as follows.
Definition 4.5.1 (Direct Succession Vector [26]). Let L ∈⊆ B(A∗ ) be an event
log, >L be the multiset of all possible directly follows relations in event log L. Let
Lv ⊆ L, be one sublog and a1 , a2 ∈ A be two activities, the direct succession vector
of the sublog Lv is defined as function S~Lv ∈ N|A×A|
0 ,given by
X
S~Lv (a1 , a2 ) = |{1 ≤ i ≤ |σ|−1|σ(i) = a1 ∧ σ(i + 1) = a2 }|, ∀a1 , a2 ∈ A
σ∈Li
Let Lv ⊆ L, be one sublog then for any activity aj ∈ A. Then the directly follows
relations of sublog Lv projected for activity aj are denoted by >Lv {aj } and the direct
succession vector is S
~Lv . Furthermore, activity aj can either be the first or the
{aj }
second element of the tuple. We split the directly follows relations and the direct
succession vector based on the aforementioned fact and denote them with S ~Lv s
{a }
j
∀aj ∈ A, L ∈ P(L)
v
with k ~v k as the length of a vector ~v . The first component relates to the following
activities after aj and the second relates to the prior activities before aj ; both range
from (0, 1) and consequently the sum ranges from (0, 2). The best candidates are
the activities with an indicator closer to 2.
To illustrate assume we have two simple sublogs
EL1 = [ha, bi, ha, b, ci, ha, ci, hb, ai], EL2 = [ha, bi, ha, ci]
of a certain simple event log EL. Then, the directly follows relations and the direct
succession vector are
>EL = [(a, b)3 , (a, c)2 , (b, a), (b, c)], ~EL1 = (2, 1, 1, 1),
S ~EL2 = (1, 1, 0, 0)
S
Consider activity a.
Split to S
~EL1 s = (2, 1, 0, 0) and S
{a }
~EL1 e = (0, 0, 1, 0) then
{a }
1 2 1 1 4
PFv(EL1 ) = × + × =
2 3 1 1 3
Consider now simple sublog EL10 introduced in section 4.1. We obtain the directly
follows relations and the direct succession vector:
>EL10 = [(A_DECLINED, (W_Beoordelen fraude)5 ,
(A_PARTLYSUBMITTED, (W_Afhandelen leads)6 ,
(A_SUBMITTED, (A_PARTLYSUBMITTED)6 ,
(W_Afhandelen leads, (A_CANCELLED),
(W_Afhandelen leads, (W_Beoordelen fraude)1 0,
(W_Beoordelen fraude, (A_DECLINED)5 ,
(W_Beoordelen fraude, (W_Afhandelen leads)5 ]
Split to S
~EL10 s = (0, 0, 0, 0, 0, 5, 5) and S
{wbf }
~EL10 e = (5, 0, 0, 0, 10, 0, 0) then
{wbf }
1 5 1 10 7
PFv(EL10 ) = × + × =
2 10 2 15 12
On the other hand, for activity W_Afhandelen leads, denominated wal we obtain
a higher value.
~EL10 s = (0, 0, 0, 1, 10, 0, 0) and S
Split to S {wal }
~EL10 e = (0, 6, 0, 0, 0, 0, 5) then
{wal }
1 10 1 6 8
PFv(EL10 ) = × + × =
2 11 2 11 11
32
Unleashing Robotic Process Automation Through Process Mining
By default, our code takes an event log in XES format as input but it can be also
configured to accept event log in CSV format. Trace clustering is then performed
using pm4py-clustering. The algorithm that has been implemented is basilar:
• A one-hot encoding of the activities of the single events is obtained.
• A PCA is perform to reduce the number of components that are considered
by the clustering algorithm; using the scikit-learn implementation. [76]
• The DBScan clustering algorithm is applied in order to split the traces into
groups; using the scikit-learn implementation [76].
The log is transformed to a log representation, using the string event attribute
concept:name by default and an optionally provided the string trace attribute. As
trace clustering has not been the main focus of this work static values are used.
Specifically, clusters are obtained using DBScan, with parameter equal to 0.3 and
MinPts equal to 5, applied to PCA projected data using 3 components.
In an exploratory setting we could analyze PCA explained variance and consider
choosing number of components based on it. In addition, since the presence of
outliers could be detrimental to the obtained principal components, instead we could
consider following a weighted approach. To this extent, a possible approach would
be to apply weighted PCA based on the eigenvalue decomposition of the weighted
covariance matrix following Delchambre [30]. The weights would be obtained by
first fitting a Minimum Covariance Determinant (MCD) robust estimator and then
getting the Mahalanobis distance. By using a robust estimator of covariance, there is
an associated guarantee that the estimation is resistant to “erroneous” observations
in the data set and that the associated Mahalanobis distances accurately reflect the
true organisation of the observations. However, this method can’t handle the case
when the estimated covariance matrix of the support data is equal to 0 and therefore
the determinant is equal to 0.
On the other hand, DBScan identifies noisy objects and is therefore resistant to
outliers. It requires two parameters to be initialized:
• MinPts: Determines the minimum number of points required to form a cluster.
• : represents the distance threshold between two points that can be considered
to be similar.
5
https://parquet.apache.org/
6
https://pandas.pydata.org/
7
https://github.com/mariagkotsopoulou/MIRI-thesis
Fundamentally, in this study we set out to answer two main Research Questions, that
were introduced in 1.3. On the one hand, in an effort to support the Analysis phase
of an RPA project, we seek to devise a methodology to determine the suitability of
a process as an automation candidate. To this extent, having as a starting point an
event log, we follow a trace clustering approach resulting in v sublogs, on which we
apply the set of the indicators defined in Chapter 4. Following the assumption that
each sublog could constitute a process, we would then need to apply an aggregating
function on each indicator and specify an objective function, to obtain the overall
score. We will explore this first Research Question using the BPI Challenge 2012
dataset [97].
On the other hand, we also put out a more challenging, though essential in premise,
Research Question, which is that of establishing an objective and generalizable
methodology that our decision support system should fall under. In order to sustain
usefulness of the contribution put forward through the definition of our indicators,
it is necessary to have an evaluation framework. Nevertheless, there are no quality
measures or baseline models to compare against, as such in a Machine Learning
setting. There are neither reference datasets, like for example an annotated image
dataset, to draw a parallel from the image classification context. This is indeed an
important drawback in terms of novel concept development. During the literature
review process of this study, we came across a work by Bosco et al. [18] in which
they generated nine artificial UI logs and most importantly made publicly available.
We will explore this second Research Question using those UI logs.
The BPI Challenge 2012 dataset [97] was also used as a Running example in Chapter
4 to facilitate the concept development and presentation of the indicators, defined
with the objective of guiding the identification of tasks and/or process properties
that are eligible to be automated using RPA. In addition, a github repository was
created were we share this evaluation task in the form of a Jupyter Notebook 1 .
The event log contains 13,087 traces and 262,200 events, recorded from October 1,
2011 to March 14, 2012, starting with a customer submitting an application and
ending with eventual conclusion of that application into an Approval, Cancellation
or Rejection (Declined). The last case is started on February 29, 2012, which means
that 13,087 cases were received within a 152 day period. This means that than on
average 86 new applications are received per day, including the weekend and holi-
days. According to the description provided on the challenge website [97], the event
log contains events from three intertwined subprocesses, which can be distinguished
by the first letter of each event name (A, O and W ). The A subprocess is concerned
with handling the applications themselves. The O subprocess handles offers send
1
https://github.com/mariagkotsopoulou/MIRI-thesis
35
Unleashing Robotic Process Automation Through Process Mining
to customers for certain applications. The W process describes how work items,
belonging to the application, are processed.
Our approach begins with a log preprocessing step followed by trace clustering.
In the case of the BPI Challenge 2012 dataset [97], we transform the log to a log
representation, using the string event attribute concept:name and the string trace at-
tribute AMOUNT_REQ. This results to obtaining Lv sublogs, specifically resulting
to v = {0, . . . , 10} containing traces {2369, 3480, 1216, 1131, 2225, 494, 1266, 883, 12,
5, 6} respectively. Then, we obtain the indicators for each sublog activity pair.
Activities repeated often, i.e. those having high EF (Lv ) are candidates for RPA,
however, those having a high EFc (Lv ) should be investigated to understand the root
causes of the repeats.
In order to determine the candidate tasks for RPA, we first look at Execution Fre-
quency, by obtaining the top 3 frequently executed activities per sublog. Then, to
obtain a global result we count the number of occurrence of the activities. To this
extent, W_Afhandelen leads is a good candidate for RPA since it is found 6 times
in the top 3 per cluster ranking of the Execution Frequency. Nevertheless, it has a
high value of Execution Frequency:case as per the mean percentile value obtained,
so it should be investigated.
As provided in the description, for subprocesses A and O only the event type com-
plete is present, indicating that a task is completed. For the W subprocess, however,
work items can be created in the queue (schedule event type), obtained by the re-
source (start) and released by the resource (complete). To this extent, values greater
than 0 are only obtained for the W subprocess. This is also the case for the Inverse
Stability indicator.
We further analyze Execution Time, by obtaining the top 3 longest executed activ-
ities per sublog. Then, to obtain a global result we count the number of occurrence
of the activities and calculate the mean of the Execution Time. The resulting po-
tential candidates coincide, to a great extent, considering both Execution Frequency
and Execution Time. So, W_Afhandelen leads, W_Completeren aanvraag and
W_Beoordelen fraude would be possible candidates to look into.
Nevertheless, since our candidate tasks should be fixed and standardized the indi-
cator of the Inverse Stability, which is based on the variance of its Execution Time,
should not be high. Consequently, taking these results into consideration it seems
that W_Afhandelen leads is not a good candidate.
The proposed indicators in our study are calculated for each sublog activity pair,
and, we have presented a characterisation of these activities averaged across each
sublog. Crucially, in order to determine the suitability of a process, as an automation
candidate, we need to present an aggregated characterisation for each sublog, based
on the assumption that each sublog could constitute a process.
Then, the selection methodology of the candidate processes is based on the identi-
fication of processes with:
As presented in Chapter 3, Bosco et al. [18] propose a method for the discovery of
routines that are fully deterministic, and hence, automatable using RPA tools. Given
a UI log, the goal of their approach is to discover automatable routines, i.e. sets of
routine traces having the parameters’ values of each of their actions derivable from
previous actions parameters’ values [18]. They conducted an experiment to assess
the ability of their approach to correctly discover all the automatable (sub)routines
recorded in a set of UI logs. To this end, they generated nine artificial UI logs, each
log containing a different number of automatable (sub)routines of varying complex-
ity. These UI logs were generated by simulating nine CPNs designed with the CPN
Tool [51]. The first six CPNs are simple and represent common real-life scenarios,
capturing clear routines with a specific goal. The last three CPNs have the highest
complexity and the routines they represent are not easily interpretable. The only
log where their approach failed to discover two routines is log L3, which contains
loops. Consider, for example, the L1 UI log which contains 100 traces, just one trace
variant, each containing 14 events.
Following our presented methodology, we apply trace clustering and obtain only
one sublog which coincides with the fact that L1 contains one routine as presented
in Table 3 [18]. As presented in Table 3 [18] the L1 UI log contains 1400 actions
coinciding with the sum of the Execution Frequency indicator (see table 6.7), while
as presented in Table 4 [18] it contains 13 distinct automatable actions. As per [18],
all the actions in L1 are automatable except the opening file one, since the input
file is chosen randomly.
However, our indicators are defined on the activity level so in Table 6.7 there are 7
activities. An action, as per the definition 1 of [18], not only contains the activity
(referred to as action type (e.g. click button)) but also the action parameters, the
values assigned to the action parameters and the function matching each action
parameter to its value. Though, two actions that have the same set of parameters,
but different assigned values, are still considered equal [18]. To this extent, even if the
click button has an estimated low Prior/follow variability indicator, suggesting that
it might not be a good candidate for RPA, instead, the trace σ0 is deterministic and
constitutes an automatable routine. Indeed, our methodology takes only control-
flow into consideration and neglects the discovery of data conditions, since it does
not relate post-conditions with pre-conditions [65]. As put forward by Leno et al.
[66] in the context of RPA, the goal is to discover executable specifications that
capture the mapping between the outputs and the inputs of the actions performed
during a routine. To this end, our proposed methodology presents a shortcoming.
P F v(Lv ) P F v(Lv )
UI log Routines AA % Processes min( EF v )
c (L )
max( EF v )
c (L )
Our presented methodology is repeated on each UI log with the steps of trace clus-
tering to obtain sublogs on which the indicators are calculated. When it comes to
establishing an objective and generalizable methodology to evaluate the proposed
quantifiable decision support framework of process task selection for RPA there are
some limitations. One issue lies in the fact that the indicators proposed herewith
We followed a two step evaluation procedure. First, the BPI Challenge 2012 dataset
[97] was used to calculate the indicators on each sublog activity pair and ideate an
overall score based on an aggregating function. However, the devised methodology
doesn’t determine the suitability of a process as an automation candidate, but it
merely provides with means of prioritizing when it comes to directing the efforts of
undertaking an RPA project. Another drawback identified is that we analyzed pro-
cess tasks independent of each other and neglected dependencies, where automation
in one process might influence other related process tasks. On the other hand, in
the second step of our evaluation procedure we used artificial UI logs included in
Bosco et al. [18] as a reference. In order to establish an objective and generalizable
methodology to evaluate the proposed quantifiable decision support framework of
process task selection for RPA, we build a metric using the indicator of Prior/follow
variability and Execution Frequency:case. The choice is made in an effort to build
an overall, and as much as possible, absolute score, that represents the basis of
the quantifiable decision support framework, in an attempt to refine the presented
methodology in the previous step of the evaluation procedure. However, comparing
the defined metric to the percentage of distinct automatable actions presented in
Bosco et al. [18], we conclude that this proposed metric fails to capture the automa-
tion potential of the processes. Moreover, our indicators are calculated for each
sublog activity while Bosco et al. [18] focuses on actions, thus making apparent the
shortcoming of our methodology, as it takes only control-flow into consideration and
neglects the discovery of data conditions, since it does not relate post-conditions
with pre-conditions [65]. As put forward by Leno et al. [66] in the context of RPA,
the goal is to discover executable specifications that capture the mapping between
43
Unleashing Robotic Process Automation Through Process Mining
the outputs and the inputs of the actions performed during a routine.
[1] W. M. P. van der Aalst, “Process mining - discovery, conformance and en-
hancement of business processes”, 2011.
[2] ——, “Decomposing petri nets for process mining: A generic approach”, Dis-
tributed and Parallel Databases, vol. 31, pp. 471–507, 2013.
[3] W. M. P. van der Aalst, M. Bichler, and A. Heinzl, “Robotic process automa-
tion”, Business & Information Systems Engineering, vol. 60, no. 4, pp. 269–
272, 2018.
[4] W. M. P. van der Aalst, A. Bolt, and S. J. van Zelst, “Rapidprom: Mine your
processes and not just your data”, ArXiv, vol. abs/1703.03740, 2017.
[5] W. M. P. van der Aalst, A. J. M. M. Weijters, and L. Maruster, “Workflow
mining: Discovering process models from event logs”, IEEE Transactions on
Knowledge and Data Engineering, vol. 16, pp. 1128–1142, 2004.
[6] W. van der Aalst, “Process mining-data science in action . 2016”,
[7] S. Agostinelli, A. Marrella, and M. Mecella, “Towards intelligent robotic pro-
cess automation for bpmers”, ArXiv, vol. abs/2001.00804, 2020.
[8] S. Aguirre and A. Rodríguez, “Automation of a business process using robotic
process automation (rpa): A case study”, in WEA, 2017.
[9] S. Anagnoste et al., “Setting up a robotic process automation center of ex-
cellence”, Management Dynamics in the Knowledge Economy, vol. 6, no. 2,
pp. 307–332, 2018.
[10] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics: Ordering
points to identify the clustering structure”, in SIGMOD ’99, 1999.
[11] A. Asatiani and E. Penttinen, “Turning robotic process automation into com-
mercial success – case opuscapita”, Journal of Information Technology Teach-
ing Cases, vol. 6, pp. 67–74, 2016.
[12] A. Augusto, R. Conforti, M. Dumas, M. L. Rosa, F. M. Maggi, A. Marrella,
M. Mecella, and A. Soo, “Automated discovery of process models from event
logs: Review and benchmark”, IEEE Transactions on Knowledge and Data
Engineering, vol. 31, pp. 686–705, 2017.
[13] A. Augusto, M. Dumas, and M. L. Rosa, “Dataset for testing the Discovery
of Automatable Routines from User Interaction Logs”, Mar. 2019. doi: 10.
6084/m9.figshare.7850918.v1. [Online]. Available: https://figshare.
com/articles/Dataset_for_testing_the_Discovery_of_Automatable_
Routines_from_User_Interaction_Logs/7850918.
[14] R. Beetz and Y. Riedl, “Robotic process automation: Developing a multi-
criteria evaluation model for the selection of automatable business processes”,
in AMCIS, 2019.
[15] A. Berti, S. J. van Zelst, and W. M. P. van der Aalst, “Pm4py web services:
Easy development, integration and deployment of process mining features in
any application stack”, in BPM, 2019.
[16] ——, “Process mining for python (pm4py): Bridging the gap between process-
and data science”, ArXiv, vol. abs/1905.06169, 2019.
[17] M. Boltenhagen, T. Chatain, and J. Carmona, “Generalized alignment-based
trace clustering of process behavior”, in Petri Nets, 2019.
45
Unleashing Robotic Process Automation Through Process Mining
[56] M. Lacity and L. Willcocks, “What knowledge workers stand to gain from
automation”, Harvard Business Review, vol. 19, no. 6, 2015.
[57] M. C. Lacity and L. P. Willcocks, “A new approach to automating services”,
MIT Sloan Management Review, 2017.
[58] M. Lacity and L. P. Willcocks, “Innovating in service: The role and manage-
ment of automation”, in Dynamic Innovation in Outsourcing, Springer, 2018,
pp. 269–325.
[59] ——, “Robotic process automation at telefónica o2”, MIS Quarterly Execu-
tive, vol. 15, 2015.
[60] G. T. Lakshmanan, S. Rozsnyai, and F. Wang, “Investigating clinical care
pathways correlated with outcomes”, in BPM, Berlin, Heidelberg: Springer
Berlin Heidelberg, 2013, pp. 323–338.
[61] S. Lazarus, Achieving a successful robotic process automation implementa-
tion: A case study of vodafone and celoni, https : / / spendmatters . com /
2018/06/07/achieving-a-successful-robotic-process-automation-
implementation-a-case-study-of-vodafone-and-ce-lonis/, 2018.
[62] M. Leemans, W. M. P. van der Aalst, and M. van den Brand, “Recursion
aware modeling and discovery for hierarchical software event log analysis”,
2018 IEEE 25th International Conference on Software Analysis, Evolution
and Reengineering (SANER), pp. 185–196, 2018.
[63] S. J. J. Leemans, D. Fahland, and W. M. P. van der Aalst, “Discovering
block-structured process models from event logs - a constructive approach”,
in Petri Nets, 2013.
[64] ——, “Using life cycle information in process discovery”, in Business Process
Management Workshops, 2015.
[65] V. Leno, M. Dumas, F. Maggi, and M. La Rosa, “Multi-perspective process
model discovery for robotic process automation”, in CEUR Workshop Pro-
ceedings, vol. 2114, 2018, pp. 37–45.
[66] V. Leno, M. Dumas, M. L. Rosa, F. M. Maggi, and A. Polyvyanyy, “Au-
tomated discovery of data transformations for robotic process automation”,
ArXiv, vol. abs/2001.01007, 2020.
[67] V. Leno, A. Polyvyanyy, M. L. Rosa, M. Dumas, and F. M. Maggi, “Action
logger: Enabling process mining for robotic process automation”, in BPM,
2019.
[68] M. de Leoni, W. M. P. van der Aalst, and M. Dees, “A general process mining
framework for correlating, predicting and clustering dynamic behavior based
on event logs”, Inf. Syst., vol. 56, pp. 235–257, 2016.
[69] H. Leopold, H. van der Aa, and H. A. Reijers, “Identifying candidate tasks
for robotic process automation in textual process descriptions”, in BPMDS /
EMMSAD@CAiSE, 2018.
[70] C. Li, M. Reichert, and A. Wombacher, “Mining process variants: Goals and
issues”, 2008 IEEE International Conference on Services Computing, vol. 2,
pp. 573–576, 2008.
[71] J. Lindström, P. Kyösti, and J. Delsing, European roadmap for industrial
process automation, 2018.
[72] P. Liu, D. Zhou, and N. Wu, “Vdbscan: Varied density based spatial clus-
tering of applications with noise”, 2007 International Conference on Service
Systems and Service Management, pp. 1–4, 2007.
Indicators & Plots to support the identification of tasks and/or process properties
that are eligible to be automated using RPA.
Parameters
-----------
log
Trace log
parameters
Parameters of the log representation algorithm:
str_ev_attr -> String event attributes to consider in feature representation (single
,→ necessary)
str_tr_attr -> String trace attributes to consider in feature representation (single
,→ additional optional)
Returns
-----------
indicator plots
indicator
A list containing, per activity all indicators in each sublog
folder structure
-home
--code: where this code is found
--data: XES files are found
--plots: resulting plots are saved
The code can be run from command line having navigated to code folder by executing:
`python pm4RPA.py financial_log.xes concept:name AMOUNT_REQ`
'''
def clusterlog(log,clusterParams):
print("Apply PCA + DBSCAN clustering from log representation obtained using ",
,→ clusterParams)
random.seed(42)
clusters = clusterer.apply(log, parameters = clusterParams)
for i in range(len(clusters)):
print("sublog",i," has ",len(clusters[i]), " traces")
return(clusters)
52
Unleashing Robotic Process Automation Through Process Mining
def priorFollowVar(clusters):
# Creating an empty Dataframe with column names only
PFv = pd.DataFrame(columns=['cluster', 'activity', 'PFv'])
activityEndMax= activityEndMax.rename(columns={'count':'max'})
activityEndNdist =
,→ paths4act_df[(paths4act_df['activityEnd']!=paths4act_df['activityStart'])]
activityEndNdist
,→ =activityEndNdist.groupby(by=['activityEnd']).size().reset_index(name='counts')
activityEndNdist= activityEndNdist.rename(columns={'counts':'ndist'})
activityEnddf = pd.merge(left = activityEndNdist , right=activityEndSum,
,→ on=['activityEnd'], how='inner')
activityEnddf = pd.merge(left = activityEnddf , right=activityEndMax,
,→ on=['activityEnd'], how='inner')
activityEnddf['PFvend'] = activityEnddf.apply(lambda x: (1.0/x['ndist']) *
,→ (x['max']/x['sum']) , axis=1)
# combine activityStartdf and activityEnddf
# outer join since some activities may only be starting ones or ending ones
PFv_df = pd.merge(left = activityStartdf.drop(['ndist','sum','max'], axis=1) ,
right=activityEnddf.drop(['ndist','sum','max'], axis=1),
right_on=['activityEnd'], left_on =['activityStart'], how='outer')
PFv_df = PFv_df.fillna(0)
PFv_df['PFv'] = PFv_df.apply(lambda x: x['PFvstart']+x['PFvend'] , axis=1)
PFv_df = PFv_df.drop(['activityEnd','PFvstart','PFvend'], axis=1)
PFv_df= PFv_df.rename(columns={'activityStart':'activity'})
PFv_df['cluster'] = clusteri
# append to result df
PFv = pd.concat([PFv, PFv_df], sort=False)
return(PFv)
def generatePlots(indicators):
plotTitles = ['Execution Frequency', 'Execution Frequency: case','Execution Time',
'Inverse Stability', 'Prior Follow Variability']
plotFilenames = ['ExecutionFrequency', 'ExecutionFreqCase', 'ExecutionTime',
'InverseStability', 'PriorFollowVar']
############## plot ############
for i,j in zip(range(2, indicators.shape[1]), range(len(plotTitles))):
plt.style.use('seaborn-whitegrid')
plt.figure()
xindex = np.arange(len(indicators))
colname = indicators.columns[i]
indicators.sort_values(by=[colname], inplace=True)
plt.bar(xindex.astype('U'), indicators[colname], color = 'indigo',width=1)
plt.xticks(np.arange(0, len(indicators), 50))
plt.title(plotTitles[j])
plt.xlabel('Activities')
plt.savefig('plots/'+plotFilenames[j]+'.png');
def main():
## input parameter data
filexes = sys.argv[1:][0]
############## import ############
path = ".."
os.chdir(path)
log = xes_import_factory.apply(os.path.join(os.getcwd(), "data", filexes))
print("Event log loaded with number of traces:", len(log))
############## trace clustering ############
# Count the arguments excluding the filename, first is necessary
activityKey = sys.argv[1:][1]
if (len(sys.argv) - 2) > 1:
if __name__ == "__main__":
main()