Streaming Applications On Multi Core
Streaming Applications On Multi Core
Dipartimento di Informatica
arXiv:0909.1187v1 [cs.DC] 7 Sep 2009
September 2, 2009
ADDRESS: Largo B. Pontecorvo 3, 56127 Pisa, Italy. TEL: +39 050 2212700 FAX: +39 050 2212726
FastFlow: Efficient Parallel Streaming
Applications on Multi-core
Marco Aldinucci∗ Massimo Torquati
Massimiliano Meneghin
September 2, 2009
Abstract
Shared memory multiprocessors come back to popularity thanks to
rapid spreading of commodity multi-core architectures. As ever, shared
memory programs are fairly easy to write and quite hard to optimise;
providing multi-core programmers with optimising tools and program-
ming frameworks is a nowadays challenge. Few efforts have been done to
support effective streaming applications on these architectures. In this
paper we introduce FastFlow, a low-level programming framework based
on lock-free queues explicitly designed to support high-level languages
for streaming applications. We compare FastFlow with state-of-the-art
programming frameworks such as Cilk, OpenMP, and Intel TBB. We ex-
perimentally demonstrate that FastFlow is always more efficient than all
of them in a set of micro-benchmarks and on a real world application;
the speedup edge of FastFlow over other solutions might be bold for fine
grain tasks, as an example +35% on OpenMP, +226% on Cilk, +96%
on TBB for the alignment of protein P01111 against UniProt DB using
Smith-Waterman algorithm.
1 Introduction
The recent trend to increase core count in commodity processors has led to
a renewed interest in the design of both methodologies and mechanisms for
the effective parallel programming of shared memory computer architectures.
Those methodologies are largely based on traditional approaches of parallel
programming.
Typically, low-level approaches provides the programmers only with prim-
itives for flows-of-control management (creation, destruction), their synchro-
nisation and data sharing, which are usually accomplished in critical regions
accessed in mutual exclusion (mutex). As an example, POSIX thread library
can be used to this purpose. Programming parallel complex applications is this
∗ Computer Science Department, University of Torino, Italy. Email: [email protected]
1
way is certainly hard; tuning them for performance is often even harder due to
the non-trivial effects induced by memory fences (used to implement mutex) on
data replicated in core’s caches.
Indeed, memory fences are one of the key sources of performance degrada-
tion in communication intensive (e.g. streaming) parallel applications. Avoid-
ing memory fences means not only avoiding locks but also avoiding any kind
of atomic operation in memory (e.g. Compare-And-Swap, Fetch-and-Add).
While there exists several assessed fence-free solutions for asynchronous symmet-
ric communications1 , these results cannot be easily extended to asynchronous
asymmetric communications2 , which are necessary to support arbitrary stream-
ing networks.
A first way to ease programmer’s task and improve program efficiency consist
in to raise the level of abstraction of concurrency management primitives. As
an example, threads might be abstracted out in higher-level entities that can be
pooled and scheduled in user space possibly according to specific strategies to
minimise cache flushing or maximise load balancing of cores. Synchronisation
primitives can be also abstracted out and associated to semantically meaningful
points of the code, such as function calls and returns, loops, etc. Intel Threading
Building Block (TBB) [25], OpenMP [33], and Cilk [16] all provide this kind of
abstraction (even if each of them in its own way).
This kind of abstraction significantly simplify the hand-coding of applica-
tions but it is still too low-level to effectively automatise the optimisation of the
parallel code: here the major weakness lies in the lack of information concerning
the intent of the code (idiom recognition [35]); inter-procedural/component op-
timisation further exacerbates the problem. The generative approach focuses on
synthesising implementations from higher-level specifications rather than trans-
forming them. From this approach, programmers’ intent is captured by the
specification. In addition, technologies for code generation are well developed
(staging, partial evaluation, automatic programming, generative programming).
Both TBB and OpenMP follow this approach. The programmer is required to
explicitly define parallel behaviour by using proper constructs [5], which clearly
delimit the interactions among flows-of-control, the read-only data, the associa-
tivity of accumulation operations, the concurrent access to shared data struc-
tures.
However, the above-mentioned programming framework for multi-core ar-
chitectures are not specifically designed to support streaming applications. The
only pattern that fits this usage is TBB’s pipeline construct, which can be used
to describe only a linear chain of filters; none of those natively support any kind
of task farming on stream items (despite it is a quite common pattern).
The objective of this paper is threefold:
2
multi-core architectures, i.e. any streaming network, including cyclic graphs
of threads.
• To study the implementation of the farm streaming network using Fast-
Flow and the most popular programming frameworks for multi-core ar-
chitectures (i.e. TBB, OpenMP, Cilk).
• To show that FastFlow farm is generally faster than the other solutions
on both a synthetic micro-benchmark and a real-world application, i.e.
the Smith-Waterman local sequence alignment algorithm (SW). This lat-
ter comparison will be performed using the same “sequential” code in all
implementations, i.e. the x86/SSE2 vectorised code derived from Far-
rar’s high-performance implementation [22]. We will also show that the
FastFlow implementation is faster than the state-of-the-art, hand-tuned
parallel version of the Farrar’s code (SWPS3 [23]).
2 Related Works
The stream programming paradigm offers a promising approach for program-
ming multi-core systems. Stream languages are motivated by the application
style used in image processing, networking, and other media processing domains.
Several languages and libraries are available for programming stream applica-
tions, but many of them are oriented to coarse grain computations. Example
are StreamIt [41], Brook [15], and CUDA [27]. Some other languages, as TBB,
provide explicit mechanisms for both streaming and other parallel paradigm,
while some others, as OpenMP [33] and Cilk mainly offers mechanisms for Data
Parallelism and Divide&Conquer computations. These mechanisms can be also
exploited to implement streaming applications, as we shall show in Sec. 3, but
this requires a greater programming effort with respect to the other cited lan-
guages.
StreamIt is an explicitly parallel programming language based on the Syn-
chronous Data Flow (SDF) programming model. A StreamIt program is repre-
sented as a set of autonomous actors that communicate through first-in first-out
(FIFO) data channels. StreamIt contains syntactic constructs for defining pro-
grams structured as task graphs, where each tasks contain Java-like sequential
code. The interconnection types provided by are: Pipeline for straight task com-
binations, SplitJoin for nesting data parallelism and FeedbackLoop for connec-
tions from consumers back to producers. The communications are implemented
either as shared circular buffers or message passing for small amounts of control
information.
Brook [15] provides extensions to C language with single program multiple
data (SPMD) operations that work on streams. User defined functions operating
3
on stream elements are called kernels and can be executed in parallel. Brook
kernels feature a blocking behaviour: the execution of a kernel must complete
before the next kernel can execute. This is the same execution model that is
available on graphics processing units (GPUs), which are indeed the main target
of this programming framework. In the same class can be enumerated CUDA
[27], which is an infrastructure from NVIDIA. In addition, CUDA programmers
are required to use low-level mechanisms to explicitly manage the various level
of the memory hierarchy.
Streaming applications are also targeted by TBB [25] through the pipeline
construct. FastFlow – as intent – is methodologically similar to TBB, since
it aims to provide a library of explicitly parallel constructs (a.k.a. parallel
programming paradigms or skeletons) that extends the base language (e.g. C,
C++, Java). However, TBB does not support any kind of non-linear streaming
network, which therefore has to be embedded in a pipeline. This has a non-
trivial programming and performance drawbacks since pipeline stages should
bypass data that are not interested with.
OpenMP [33] and Cilk [14] are other two very popular thread-based frame-
works for multi-core architectures (a in deep language descriptions is reported
in 3.2 and 3.3 sections). OpenMP and Cilk mostly target Data Parallel and
Divide&Conquer programming paradigms, respectively. OpenMP for example
has only recently extended (3.0 version) with a task construct to manage the
execution of a set of independent tasks. The fact that the two languages do not
provide first class mechanisms for streaming applications is reflected in their
characteristic of well performing only with coarse- and medium-grained compu-
tations, as we see in Sec. 4.
At the level of communication and synchronisation mechanisms, Giacomini
et al. [24] highlight that traditional locking queues feature a high overhead
on today multi-core. Revisiting Lamport work [29], which proves the correct-
ness of wait-free mechanisms for concurrent Single-Producer-Single-Consumer
(SPSC) queues on system with memory sequential consistency commitment,
they proposed a set of wait-free and cache-optimised protocols. They also prove
the performance benefit of those mechanisms on pipeline applications on top of
today multi-core architectures. Wait-free protocols are a subclass of lock-free
protocols exhibiting even stronger properties: roughly speaking lock-free pro-
tocols are based on retries while wait-free protocols guarantee termination in a
finite number of steps.
Along with SPSC queues, also MPMC queues are required to provide a com-
plete support for streaming networks. Those kind of data structures represent
a more general problem than SPSC one, and various works has been presented
in literature [28, 31, 36, 42]. Thanks to the structure of streaming applications,
we avoid the problem of managing directly MPMC queue: we exploit multiple
SPSC queues to implement MPSC, SCMP and MPMC ones.
Therefore exploiting a wait-free SPSC also for implementing more complex
shared queues, FastFlow widely extend the work of Giacomini et al., from simple
pipelines to any streaming networks. We show effective benefits of our approach
with respect to the other languages TBB, OpenMP and Cilk.
4
3 Stream Parallel Paradigm: the Farm Case
Traditionally types of parallelisms are categorised in three main classes:
5
Farm can be declined in many variants, as for example with stateless workers,
stateful workers (local or shared state, read-only or read/write), etc.
The farm skeleton is quite useful since it can exploited in many streaming
applications. In particular, it can be used in any pipeline to boost the service
time of slow stages, then to boost the whole pipeline [4].
As mentioned in previous section, several programming framework for multi-
core offer Data Parallel and Task Parallel skeletons, only few of them offer
Stream Parallel skeletons (such as TBB’s pipeline), none of them offers the
farm. In the following we study the implementation of the farm for multi-core
architectures. In Sec. 3.1 we introduce a very efficient implementation of the
farm construct in FastFlow, and we propose our implementation using other
well-known frameworks such as OpenMP, Cilk, and TBB. The performance are
compared in Sec. 4.
6
E C
Lock-free
SPSC queue SCMP queue MCSP queue
Scheduling tags
W1 W1
W2 W2
E C E C
Wn Task farm Wn
Task farm (order preserving)
written: in many cases this can be derived by the semantics of the skeleton that
has been implemented using MPMC queues (as an example this is guaranteed
in a stateless farm and many other cases).
When using dynamically allocated memory, the memory allocator plays an
important role in term of performance. Dynamic memory allocators (malloc/free)
rely on mutual exclusion locks for protecting the consistency of their shared data
structures under multi-threading. Therefore, the use of memory allocator may
subtly reintroduce the locks in the lock-free application. For this reason, we de-
cided to use our own custom memory allocator, which has specifically optimised
for SPMC pattern. The basic assumption is that, in streaming application,
typically, one thread allocate memory and one or many other threads free mem-
ory. This assumption permits to develop a multi-threaded memory allocator
that use SPSC channels between the allocator thread and the generic thread
that performs the free, avoiding the use of costly lock based protocols for main-
taining the memory consistency of the internal structures. Notice however, the
FastFlow allocator is not a general purpose allocator and it currently exhibits
several limitations, such as a sub-optimal space usage. The further development
of FastFlow allocator is among future works.
3.1.1 Pseudo-code
The structure of the farm paradigm in FastFlow is sketched in Fig. 2. The
ff TaskFarm is a C++ class interface that implements the parallel farm construct
composed by an Emitter and an optional Collector (also see Fig. 1). The number
of workers should be fixed at the farm object creation time.
7
class Emitter: public ff::ff_node {
public:
void * svc(void *) {
while(∃ newtask){
newtask = create_task();
return newtask;
}
return NULL; //EOS
}
};
farm.run();
}
8
scheduling and collection, or by dynamically tracking scheduling choices and
performing the collection accordingly. This latter solution, schematised in Fig. 1
(order preserving task farm), is actually derived from tagged-token macro data-
flow architecture [10, 9, 38].
9
int main (int argc, char *argv[])
{
#pragma omp parallel private(newtask)
{
/* EMITTER */
#pragma omp single nowait
{
while(∃ newtask){
newtask = create_task();
/* WORKER */
#pragma omp task untied
{
compute_task(newtask);
/* COLLECTOR */
#pragma omp critic
{
collect_task(newtask);
}
}
}
}
}
3.2.1 Pseudo-code
OpenMP do not natively include a farm skeleton, which should be realised using
lower-level features, such as the task construct. Our OpenMP farm schema is
shown in Fig. 3.2.1. The schema is quite simple; a single section is exploited
to highlight the Emitter behaviour. The new independent tasks, defined by the
Emitter, are marked with the task directive in order to leave their computation
scheduling to the OpenMP run time support.
The Collector is implemented in a different way. Instead of implementing it
with a single section (as for the Emitter), and therefore introducing an explicit
locking mechanism for synchronisation between workers and Collector, we realise
Collector functionality by means of workers cooperative behaviour: they simply
output tasks using an OpenMP critic section. This mechanism enable us to
output tasks from the stream without introducing any global synchronisation
(barrier).
10
consistency [13], which is a quite relaxed consistency model. Cilk-threads syn-
chronise according to the DAG consistency at the join (sync construct), and
optionally, atomically execute a sort of call-back function (inlet procedure).
Cilk lock variables are provided to define atomic chunks of code, enabling
programmers to address synchronisation patterns that cannot be expressed us-
ing DAG consistency. As matter of fact, Cilk lock variables represent an escape
in a programming model which has been designed for avoiding critical regions.
3.3.1 Pseudo-code
Our reference code structure for a farm implemented in Cilk is introduced in
Fig. 3.3.1. A thread is spawn at the beginning of the program to implement
the emitter behaviour and remain active until the end of the computation. The
emitter thread defines new tasks and spawn new threads for their computation.
To avoid explicit lock mechanism we target inlet constructs. Ordinarily, a
spawned Cilk thread can return its results only to the parent thread, putting
those results in a variable in the parent’s frame. The alternative is to exploit
an inlet, which is a function internal to a Cilk procedure to handle the results
of a spawned thread call as it returns. One major reason to use inlets is that all
the inlets of a procedure are guaranteed to operate atomically with regards to
each other and to the parent procedure, thus avoiding race conditions that can
come out when the multiple returning threads try to update the same variables
in the parent frame.
The inlet, which can be compared with OpenMP critic sections, can be
easily exploited to implement the Collector behaviour as presented in the defi-
nition of the emitter function in Fig. 3.3.1.
Because inlet feature the limitation that the function have to be called
from the cilk procedure that hosts the function, our emitter procedure, and our
worker procedure have to be the same to use inlet. We differentiate the two
behaviour exploiting a tag parameter and switching on its value.
11
cilk int * emitter(int * newtask, int tag) {
inlet void collector(int * newtask) {
collect_task(newtask);
}
switch(tag) {
case W ORKER: {
compute_task(newtask);
}break;
case EM IT T ER: {
while(∃ newtask){
newtask = create_task();
collector(spawn emitter(newtask, W ORKER));
}
}break;
default: ;
}
return newtask;
}
3.4.1 Pseudo-code
The structure of the farm paradigm using the TBB library is sketched in Fig. 5.
The implementation is based on the pipeline construct. The pipeline is com-
posed of three stages: Emitter, Worker, Collector. The corresponding three
objects are registered with the pipeline object in order to instantiate the cor-
rect communication network. The Emitter stage produces a pointer to arrays of
basic tasks, referred as Task in the pseudo-code, each one of length PIPE GRAIN
(for our experiments we set the PIPE GRAIN to 1024). The Worker stage is actu-
ally a filter that allows the execution of the parallel for on input tasks. The
parallel for is executed using the auto partitioner algorithm provided by the
TBB library, this way the correct splitting of the Task array in chunks of basic
tasks which are assigned to the executor threads, is left to the run-time support.
12
class Emitter:public tbb::filter {
public:
Emitter(const int grain):tbb::filter(tbb::serial_in_order),grain(grain) {}
void * operator()(void*) {
newtaskt * Task[grain];
while(∃ newtask){
Task = create_task();
return Task;
}
return NULL; //EOS
}
};
class Compute {
task_t ** Task;
public:
Compute(task_t ** Task):Task(Task){}
void operator() (const tbb::blocked_range<int>& r) const {
for (int i=r.begin();i<r.end();++i)
compute_task(Task[i]);
}
};
13
form with 2 quad-core Xeon E5420 Harpertown @2.5GHz with 6MB L2 cache
and 8 GBytes of main memory.
14
Tc = 0.5 µS (fine grain)
8
Ideal
FastFlow
7 TBB
OpenMP
6 Cilk
5
Speedup
4
0
1 2 3 4 5 6 7 8
N. of cores
TC = 5 µS (medium grain)
8
Ideal
FastFlow
7 TBB
OpenMP
6 Cilk
5
Speedup
0
1 2 3 4 5 6 7 8
N. of cores
Tc = 50 µS (coarse grain)
8
Ideal
FastFlow
7 TBB
OpenMP
6 Cilk
5
Speedup
0
1 2 3 4 5 6 7 8
N. of cores
15
Stream Task Time
Query Query len Min (µS) Max (µS) Avg (µS)
P02232 144 0.333 2264.9 25.0
P10635 497 0.573 15257.6 108.0
P27895 1000 0.645 16011.9 197.0
P04775 2005 0.690 21837.1 375.0
P04775 5478 3.891 117725.0 938.5
computing their optimal local alignments using the Smith-Waterman (SW) algo-
rithm [44]. SW is a dynamic programming algorithm that guaranteed to find the
optimal local alignment with respect to the scoring system being used. Instead
of looking at the total sequence, it compares segments of all possible lengths
and optimises the similarity measure. The costs of this approach is expensive
in terms of computing time and memory space used due to the rapid growth
of biological sequence databases (the UniProtKB/Swiss-Prot database Release
57.5 of 07-Jul-09 contains 471472 sequence entries, comprising 167326533 amino
acids) [43].
The recent emergence of multi- and many-core architectures provides the
opportunity to significantly reduce the computation time for many costly al-
gorithms like the Smith-Waterman one. Recent works in this area focus on
the implementation of the SW algorithm on many-core architectures like GPUs
[30] and Cell/BE [22] and on multi-core architectures exploiting the SSE2 in-
struction set [23, 39]. Among these implementations, we selected the SWPS3
[40] an optimised extension of the Farrar’s work presented in [23] of the Strip
Waterman-Algorithm for the Cell/BE and on x86/64 CPUs with SSE2 instruc-
tions. The original SWPS3 version is designed as a master-worker computation
where the master process distribute the workload toward a set of worker pro-
cesses. The master process read the query sequence, initialise the data structures
needed for the SSE2 computation, and then fork all the worker processes so that
each worker has its own copy of the data. All the sequences in the reference
database are read and sent to the worker processes over POSIX pipes. The
worker computes the alignment score of the query with the database sequence
provided by the master process, and sent back over a pipe the resulting score.
The computational time is sensitive with respect to the query length used
for the matching, the scoring matrix (in our case BLOWSUM50) and the gap
penalty. As can be seen from Table 1 very small sequences require a smaller
service time with respect to the longest one. Notice the high variance in the
task service time reported in the table, this is due to the very different length
of the subject sequence in the reference database (the average sequence length
in UniProtKB/Swiss-Prot is 352 amino acids, the shortest sequence comprise
2 amino acids whereas the longest one 35213 amino acids). Furthermore, the
higher the gap open and gap extension penalties, the fewer iterations are needed
16
5-2k gap penalty
2
15 0
20 0
25 5
30 4
35 5
40 4
54 1
22 8
15
4
9
9
2
6
5
4
7
3
7
0
0
0
0
0
6
6
7
14
18
18
22
24
37
46
49
55
56
10
20
18
16
14
GCUPS
12
10
8 FastFlow
OpenMP
6 Cilk
4 TBB
SWPS3
2
P0
P0 23
P0 11
P1 01
P0 94
P0 76
P0 32
P1 00
P2 63
P0 70
P2 43
P0 89
P0 75
P1 77
P2 09
P0 16
P2 6B
Q 93
Q KN
9U 0
8W 1
2
1 2
5 1
4 3
0 2
7 2
1 7
0 8
5 5
3 5
7 5
7 5
4 6
9 5
8 6
C 7
0 8
XI
7
Query sequence (protein)
10-2k gap penalty
2
00
00
25 5
30 4
35 5
40 4
54 1
22 8
15
4
9
9
2
6
5
4
7
3
7
0
0
0
6
6
7
14
18
18
22
24
37
46
49
55
56
10
15
20
35
30
25
GCUPS
20
15
FastFlow
10 OpenMP
Cilk
5 TBB
SWPS3
0
P0
P0 23
P0 11
P1 01
P0 94
P0 76
P0 32
P1 00
P2 63
P0 70
P2 43
P0 89
P0 75
P1 77
P2 09
P0 16
P2 6B
Q 93
Q KN
9U 0
8W 1
2
1 2
5 1
4 3
0 2
7 2
1 7
0 8
5 5
3 5
7 5
7 5
4 6
9 5
8 6
C 7
0 8
XI
7
17
for the calculation of the single cell of the similarity score matrix. In our tests
we used the scoring matrix BLOSUM50 with two gap penalty range: 10-2k and
5-2k.
We rewrote the original SWPS3 code in OpenMP, Cilk, TBB and FastFlow
following the schemata presented before. In this, we did not modify the se-
quential code at all to achieve a fair comparison. For performance reasons,
it is important to provide each worker threads with a copy of the data struc-
tures needed for the SSE2 computation. This is a critical aspects especially
for implementations in Cilk and TBB, which do not natively support any kind
of Thread-Specific-Storage (TSS). Notwithstanding this data is read-only, the
third-party SSE somehow seems triggering the cache invalidation accessing the
data, which seriously affect the performance. To overcome this problem we ex-
ploit a tricky solution: we use TSS exploiting a lower-level with respect to the
programming model. In OpenMP this is not a problem because we have the pos-
sibility to identify the worker thread with the library call omp get thread num().
The same possibility to identify a thread is offered by FastFlow framework as
each parallel entity is mapped on one thread.
The Emitter entity reads the sequence database and produce a stream of
pairs: hquery sequence, subject sequencei. The query sequence remains the
same for all the subject sequences contained in the database. The Worker entity
computes the striped Smith-Waterman algorithm on the input pairs using the
SSE2 instructions set. The Collector entity gets the resulting score and produce
the output string containing the score and the sequence name.
To remove the dependency on the query sequences and the databases used
for the tests, Cell-Updates-Per-Second (CUPS) is a commonly used performance
measure in bioinformatics. A CUPS represents the time for a complete com-
putation of one cell in the matrix of the similarity score, including all memory
operations. Given a query sequence of length Q and a database of size D, the
GCUPS (billion Cell Updates Per Second) value is calculated by:
|Q| |D|
GCU P S =
T 109
where T is the total execution time in seconds. The performance of the
different SW algorithm implementations has been benchmarked and analysed
by searching for 19 sequences of length from 144 (the P02232 sequence) to 22,142
(the Q8WXI7 sequence) against Swiss-Prot release 57.5 database. The tests has
been carried out on a dual quad-core Intel Xeon @2.50GHz running the Linux
OS (kernel 2.6.x).
Figure 7 reports the performance comparison between FastFlow, OpenMP,
Cilk, TBB and SWPS3 version of SW algorithm for x86/SSE2 executed on the
test platform described above.
As can be seen from the figures, the FastFlow implementation outperforms
the other implementations for short query sequences. The smallest the query
sequences are the bigger the performance gain is. This is mainly due to lower
overhead of FastFlow communication channels with respect to the other imple-
mentations; short sequences require a smaller service time.
18
Cilk obtains lower performance value with respect to the original SWPS3 ver-
sion with small sequences while performs very well with longer ones. OpenMP
offers the best performance after FastFlow. Quite surprisingly TBB does not ob-
tain the same good speedup that has been obtained with the micro-benchmark.
It is still not clear which are the reasons, further investigation is required to find
out the overhead source in the TBB version.
5 Conclusions
In this work we have introduced FastFlow, a low-level template library based
on lock-free communication channel explicitly designed to support low-overhead
high-throughput streaming applications on commodity cache-coherent multi-
core architectures. We have shown that FastFlow can be directly used to imple-
ment complex streaming applications exhibiting cutting-edge performance on a
commodity multi-core.
Also, we have demonstrated that FastFlow makes it possible the efficient
parallelisation of third-party legacy code, as the x86/SSE vectorised Smith-
Waterman code. In the short term, we envision FastFlow as middleware tier
of a “skeletal” high-level programming framework that will discipline the us-
age of efficient network patterns, possibly extending an existing programming
framework (e.g. TBB) with stream-specific constructs. As this end, we studied
how a streaming farm can be realised using several state-of-the-art programming
frameworks for multi-core, and we have experimentally demonstrated that Fast-
Flow farm is faster than other farm implementations on both synthetic bench-
mark and Smith-Waterman application.
As expected, the performance edge of FastFlow over the other frameworks is
bold for fine-grained computations. This makes FastFlow suitable to implement
a fast macro data-flow executor (actually wrapping around the order preserving
farm), thus to achieve the automatic parallelisation of many classes of algo-
rithms, including dynamic programming [6]. FastFlow will be released as open
source library.
A preliminary version of this work has been presented at the ParCo confer-
ence [8].
6 Acknowledgments
We thank Marco Danelutto, Marco Vanneschi and Peter Kilpatrick for the many
insightful discussions.
This work was partially funded by the project BioBITs (“Developing White
and Green Biotechnologies by Converging Platforms from Biology and Informa-
tion Technology towards Metagenomic”) of Regione Piemonte, by the project
FRINP of the “Fondazione della Cassa di Risparmio di Pisa”, and by the WG
Ercim CoreGrid topic “Advanced Programming Models”.
19
References
[1] Marco Aldinucci. eskimo: experimenting with skeletons in the shared ad-
dress model. Parallel Processing Letters, 13(3):449–460, September 2003.
[2] Marco Aldinucci, Sonia Campa, Pierpaolo Ciullo, Massimo Coppola, Sil-
via Magini, Paolo Pesciullesi, Laura Potiti, Roberto Ravazzolo, Massimo
Torquati, Marco Vanneschi, and Corrado Zoccolo. The implementation
of ASSIST, an environment for parallel and distributed programming. In
Proc. of 9th Intl Euro-Par 2003 Parallel Processing, volume 2790 of LNCS,
pages 712–721, Klagenfurt, Austria, August 2003. Springer.
[3] Marco Aldinucci, Massimo Coppola, Marco Danelutto, Marco Vanneschi,
and Corrado Zoccolo. ASSIST as a research framework for high-
performance grid programming environments. In Grid Computing: Soft-
ware environments and Tools, chapter 10, pages 230–256. Springer, January
2006.
[4] Marco Aldinucci and Marco Danelutto. Stream parallel skeleton opti-
mization. In Proc. of PDCS: Intl. Conference on Parallel and Distributed
Computing and Systems, pages 955–962, Cambridge, Massachusetts, USA,
November 1999. IASTED, ACTA press.
[5] Marco Aldinucci and Marco Danelutto. Skeleton based parallel program-
ming: functional and parallel semantic in a single shot. Computer Lan-
guages, Systems and Structures, 33(3-4):179–192, October 2007.
[6] Marco Aldinucci, Marco Danelutto, Jan Dünnweber, and Sergei Gorlatch.
Optimization techniques for skeletons on grid. In Grid Computing and
New Frontiers of High Performance Processing, volume 14 of Advances in
Parallel Computing, chapter 2, pages 255–273. Elsevier, October 2005.
[7] Marco Aldinucci, Marco Danelutto, and Peter Kilpatrick. Towards hierar-
chical management of autonomic components: a case study. In Proc. of Intl.
Euromicro PDP 2009: Parallel Distributed and network-based Processing,
pages 3–10, Weimar, Germany, February 2009. IEEE.
[8] Marco Aldinucci, Marco Danelutto, Massimiliano Meneghin, Peter Kil-
patrick, and Massimo Torquati. Fastflow: Fast macro data flow execu-
tion on multi-core. In Intl. Parallel Computing (PARCO), Lyon, France,
September 2009.
[9] Marco Aldinucci, Marco Danelutto, and Paolo Teti. An advanced environ-
ment supporting structured parallel programming in Java. Future Gener-
ation Computer Systems, 19(5):611–626, July 2003.
[10] K. Arvind and Rishiyur S. Nikhil. Executing a program on the mit tagged-
token dataflow architecture. IEEE Trans. Comput., 39(3):300–318, 1990.
20
[11] Eduard Ayguadé, Nawal Copty, Alejandro Duran, Jay Hoeflinger, Yuan
Lin, Federico Massaioli, Xavier Teruel, Priya Unnikrishnan, and Guansong
Zhang. The design of openmp tasks. IEEE Trans. Parallel Distrib. Syst.,
20(3):404–418, 2009.
[12] Holger Bischof, Sergei Gorlatch, and Roman Leshchinskiy. DatTel: A data-
parallel C++ template library. Parallel Processing Letters, 13(3):461–472,
2003.
[13] Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leis-
erson, and Keith H. Randall. Dag-consistent distributed shared memory.
In Proc. of the 10th Intl. Parallel Processing Symposium, pages 132–141,
Honolulu, Hawaii, USA, April 1996.
[14] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E.
Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An efficient multi-
threaded runtime system. Journal of Parallel and Distributed Computing,
37(1):55–69, August 1996.
[15] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian,
Mike Houston, and Pat Hanrahan. Brook for gpus: stream computing on
graphics hardware. In ACM SIGGRAPH ’04 Papers, pages 777–786, New
York, NY, USA, 2004. ACM Press.
[16] Christopher Cole and Maurice Herlihy. Snapshots and software transac-
tional memory. Sci. Comput. Program., 58(3):310–324, 2005.
[17] Murray Cole. Algorithmic Skeletons: Structured Management of Parallel
Computations. Research Monographs in Parallel and Distributed Comput-
ing. Pitman, 1989.
[18] Murray Cole. Skeletal Parallelism home page, 2009 (last accessed). http:
//homepages.inf.ed.ac.uk/mic/Skeletons/.
[19] Marco Danelutto, Roberto Di Meglio, Salvatore Orlando, Susanna Pela-
gatti, and Marco Vanneschi. A methodology for the development and the
support of massively parallel programs. Future Generation Compututer
Systems, 8(1-3):205–220, 1992.
[20] J. Darlington, A. J. Field, P.G. Harrison, P. H. J. Kelly, D. W. N. Sharp,
R. L. While, and Q. Wu. Parallel programming using skeleton functions. In
Proc. of Parallel Architectures and Langauges Europe (PARLE’93), volume
694 of LNCS, pages 146–160, Munich, Germany, June 1993. Springer.
[21] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data pro-
cessing on large clusters. In Usenix OSDI ’04, pages 137–150, December
2004.
[22] Michael Farrar. Smith-Waterman for the cell broadband engine.
21
[23] Michael Farrar. Striped Smith-Waterman speeds database searches six
times over other simd implementations. Bioinformatics, 23(2):156–161,
2007.
[24] John Giacomoni, Tipp Moseley, and Manish Vachharajani. Fastforward for
efficient pipeline parallelism: a cache-optimized concurrent lock-free queue.
In Proc. of the 13th ACM SIGPLAN Symposium on Principles and practice
of parallel programming (PPoPP), pages 43–52, New York, NY, USA, 2008.
ACM.
[25] Intel Corp. Threading Building Blocks, 2009. http://www.
threadingbuildingblocks.org/.
[26] Intel Corp. Intel Threading Building Blocks, July 2009 (last accessed).
http://software.intel.com/en-us/intel-tbb/.
[27] David Kirk. Nvidia cuda software and gpu parallel computing architecture.
In Proc. of the 6th Intl. symposium on Memory management (ISM), pages
103–104, New York, NY, USA, 2007. ACM.
[28] Edya Ladan-mozes and Nir Shavit. An optimistic approach to lock-free fifo
queues. In In Proc. of the 18th Intl. Symposium on Distributed Computing,
LNCS 3274, pages 117–131. Springer, 2004.
[29] Leslie Lamport. Specifying concurrent program modules. ACM Trans.
Program. Lang. Syst., 5(2):190–222, 1983.
[30] Yongchao Liu, Douglas Maskell, and Bertil Schmidt. CUDASW++: op-
timizing Smith-Waterman sequence database searches for CUDA-enabled
graphics processing units. BMC Research Notes, 2(1):73, 2009.
[31] H. Massalin and C. Pu. Threads and input/output in the synthesis kernal.
SIGOPS Oper. Syst. Rev., 23(5):191–201, 1989.
[32] Maged M. Michael and Michael L. Scott. Nonblocking algorithms and
preemption-safe locking on multiprogrammed shared memory multiproces-
sors. Journal of Parallel and Distributed Computing, 51(1):1–26, 1998.
[33] Insung Park, Michael J. Voss, Seon Wook Kim, and Rudolf Eigenmann.
Parallel programming environment for openmp. Scientific Programming,
9:143–161, 2001.
[34] M. Poldner and H. Kuchen. Scalable farms. In Proc. of Intl. PARCO 2005:
Parallel Computing, Malaga, Spain, September 2005.
[35] Bill Pottenger and Rudolf Eigenmann. Idiom recognition in the Polaris par-
allelizing compiler. In Proc. of the 9th Intl. Conference on Supercomputing
(ICS ’95), pages 444–448, New York, NY, USA, 1995. ACM Press.
22
[36] S. Prakash, Yann Hang Lee, and T. Johnson. A nonblocking algorithm for
shared queues using compare-and-swap. IEEE Trans. Comput., 43(5):548–
559, 1994.
[37] James Reinders. Intel Threading Building Blocks: Outfitting C++ for
Multi-core Processor Parallelism. O’Reilly, 2007.
23