CSC580
Parallel Processing
LECTURE 4:
Parallel Computing Design (PART 1)
PREPARED BY: SALIZA RAMLY
Topic Overview
This topic introduces the students:
Algorithms and Concurrency
o Introduction to Parallel Algorithms
o Tasks and Decomposition
o Processes and Mapping
o Processes Versus Processors
o Decomposition Techniques
o Recursive Decomposition
o Data Decomposition
o Exploratory Decomposition
o Speculative Decomposition
SALIZA RAMLY - CSC580
Introduction to
Parallel Algorithms
SALIZA RAMLY - CSC580
Introduction
Parallel algorithm
o It tells us how to solve a given problem using multiple processors.
o It involves more than just specifying the steps.
o It has the added dimension of concurrency
o The algorithm designer must specify sets of steps that can be executed simultaneously.
o This chapter methodically discuss the process of designing and implementing
parallel algorithm.
SALIZA RAMLY - CSC580
Preliminaries
SALIZA RAMLY - CSC580
Preliminaries: Decomposition, Tasks, and
Dependency Graphs
o The first step in developing a parallel algorithm is to decompose the problem
into tasks that can be executed concurrently
o A given problem may be docomposed into tasks in many different ways.
o Tasks may be of same, different, or even in various sizes.
o A decomposition can be illustrated in the form of a directed graph with nodes
corresponding to tasks and edges indicating that the result of one task is
required for processing the next. Such a graph is called a task dependency
graph.
SALIZA RAMLY - CSC580
Example: Multiplying a Dense Matrix
with a Vector
Observations: While tasks
share data (namely, the
vector b), they do not have
any control dependencies -
i.e., no task needs to wait
for the (partial) completion
of any other. All tasks are
of the same size in terms of
number of operations.
Is this the maximum
Computation of each element of output vector y is independent of other number of tasks we could
elements. Based on this, a dense matrix-vector product can be decompose this problem
decomposed into n tasks. The figure highlights the portion of the matrix into?
and vector accessed by Task 1.
SALIZA RAMLY - CSC580
Example: Database Query Processing
Consider the execution of the query:
MODEL = ``CIVIC'' AND YEAR = 2001 AND (COLOR = ``GREEN'' OR COLOR = ``WHITE)
on the following database: ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
A database storing information about used vehicles
SALIZA RAMLY - CSC580
Example: Database Query Processing
The execution of the query can be
divided into subtasks in various
ways. Each task can be thought of
as generating an intermediate
table of entries that satisfy a
particular clause.
Task: create sets of elements that satisfy a (or several) criteria.
Edge: output of one task serves as input to the next
SALIZA RAMLY - CSC580
Example: Database Query Processing
Note that the same problem
can be decomposed into
subtasks in other ways as well.
Different task decomposition
leads to different parallelism
An alternate decomposition of the given problem into
subtasks, along with their data dependencies.
SALIZA RAMLY - CSC580
Preliminaries: Granularity, Concurrency,
and Task-Interaction
The number of tasks into which a problem is decomposed determines its
granularity.
Fine-grained • Decomposition into a large
decomposition number of tasks
Coarse grained • Decomposition into a small Each task in this example
corresponds to the
decomposition number of tasks
computation of three
elements of the result
vector.
SALIZA RAMLY - CSC580
Degree of Concurrency
Degree of Concurrency: the number of tasks that can execute in parallel
maximum degree of concurrency average degree of concurrency
average number of tasks that
largest number of concurrent
can be executed concurrently
tasks at any point of the
over the execution of the
execution.
program.
Degree of Concurrency vs. Task Granularity : Inverse relation
The degree of concurrency increases as the decomposition becomes finer in granularity and vice versa.
SALIZA RAMLY - CSC580
Critical Path of Task Graph
Directed path a sequence of tasks that must be processed one after the other
The longest directed path between any pair of start node (node with
Critical path
no incoming edge) and finish node (node with on outgoing edges).
Critical path
The sum of weights of nodes along critical path.
length
Average
degree of total amount of work / critical path length
concurrency
SALIZA RAMLY - CSC580
Q: What the average degree of
concurrency in each decomposition?
Consider the task dependency graphs of the two databases query decompositions:
Task Dependency graphs
Graph (a): Graph (b):
Max no. of concurrency = ? Max no. of concurrency = ?
Critical path length = ? Critical path length = ?
Average degree of concurrency = ? Average degree of concurrency = ?
SALIZA RAMLY - CSC580
Q: What is the upper bound of the
number of concurrent tasks?
Multiplying a dense matrix with a vector: there can be no more than (n2) concurrent tasks.
SALIZA RAMLY - CSC580
Limits on Parallelization
o It would appear that the parallel time can be made
Q: The larger the number of arbitrarily small by making the decomposition finer
concurrent tasks, the better? in granularity.
o There is an inherent bound on how fine the
granularity of a computation can be. For example,
in the case of multiplying a dense matrix with a
vector, there can be no more than (n2) concurrent
tasks.
o Concurrent tasks may also have to exchange data
with other tasks. This results in communication
overhead. The tradeoff between the granularity of
a decomposition and associated overheads often
determines performance bounds.
SALIZA RAMLY - CSC580
Task Interaction Graphs
Subtasks exchange data with others in a decomposition.
For example, even in the trivial
decomposition of the dense matrix-
vector product, if the vector is not
replicated across all tasks, they will have
to communicate elements of the vector.
The graph of tasks(nodes) and their interactions/data exchange (edges) is referred to as a
task interaction graph.
Note: -> task interaction graphs represent data dependencies.
-> task dependency graphs represent control dependencies.
SALIZA RAMLY - CSC580
Q: Can you explain this task interaction
graph?
Multiplying a sparse matrix A with a vector b.
• The computation of each element of the result vector is a task.
Task Interaction graph
• Only non-zero elements of matrix A participate in the computation.
• We partition b across tasks, then the task interaction graph of the computation is identical to the graph of the
matrix A
SALIZA RAMLY - CSC580
Task Interaction Graphs, Granularity, and
Communication
If the granularity of a decomposition is finer, the associated
overhead (as a ratio of useful work associated with a task)
increases.
SALIZA RAMLY - CSC580
Task Interaction Graphs, Granularity, and
Communication
Example: Each node takes unit time to process and each interaction (edge) causes
an overhead of a unit time.
Viewing node 0 as an independent task involves a useful computation of one time
unit and overhead (communication) of three time units. How to solve this problem?
SALIZA RAMLY - CSC580
Processes and Mapping
o In general, the number of tasks in a decomposition exceeds the number of
processing elements available.
o For this reason, a parallel algorithm must also provide a mapping of tasks to
processes.
SALIZA RAMLY - CSC580
Processes and Mapping
Note:
o We refer to the mapping as being from tasks to processes, as opposed to
processors.
o This is because typical programming APIs, as we shall see, do not allow easy
binding of tasks to physical processors.
o Rather, we aggregate tasks into processes and rely on the system to map these
processes to physical processors.
o We use processes (not in the UNIX sense of a process) simply as a collection of
tasks and associated data.
SALIZA RAMLY - CSC580
Processes and Mapping
Appropriate mapping of tasks to processes is critical to the parallel performance of
an algorithm.
Mappings are determined by both the task dependency and task interaction graphs.
Task dependency graphs Task interaction graphs
can be used to ensure that work is equally can be used to make sure that processes
spread across all processes at any point need minimum interaction with other
(minimum idling and optimal load balance). processes (minimum communication).
SALIZA RAMLY - CSC580
Processes and Mapping
An appropriate mapping must minimize parallel execution time by:
o Mapping independent tasks to different processes.
o Assigning tasks on critical path to processes as soon as they become available.
o Minimizing interaction between processes by mapping tasks with dense interactions
to the same process.
Note: These criteria often conflict with each other. For example, a decomposition
into one task (or no decomposition at all) minimizes interaction but does not
result in a speedup at all! Can you think of other such conflicting cases?
SALIZA RAMLY - CSC580
Processes and Mapping: Example
Mapping tasks in the database query decomposition to processes. These
mappings were arrived at by viewing the dependency graph in terms of levels
(no two nodes in a level have dependencies). Tasks within a single level are then
assigned to different processes.
SALIZA RAMLY - CSC580
Decomposition
Techniques
SALIZA RAMLY - CSC580
Decomposition Techniques
So how does one decompose a task into various subtasks?
While there is no single recipe that works for all problems, we present a set of commonly
used techniques that apply to broad classes of problems.
Decomposition Techniques
recursive data exploratory speculative
decomposition decomposition decomposition decomposition
SALIZA RAMLY - CSC580
Recursive Decomposition
o Generally suited to problems that are solved using the divide-and-conquer
strategy.
o A given problem is first decomposed into a set of sub-problems.
o These sub-problems are recursively decomposed further until a desired
granularity is reached.
SALIZA RAMLY - CSC580
Recursive Decomposition Example
A classic example of a divide-and-
conquer algorithm on which we can
apply recursive decomposition is
Quicksort.
In this example, once the list has
been partitioned around the pivot,
each sublist can be processed
concurrently (i.e., each sublist
represents an independent subtask).
This can be repeated recursively.
SALIZA RAMLY - CSC580
Recursive Decomposition Example
The problem of finding the minimum number in a given list (or indeed any other
associative operation such as sum, AND, etc.) can be fashioned as a divide-and-
conquer algorithm. The following algorithm illustrates this.
We first start with a simple serial loop for computing the minimum entry in a
given list:
1. procedure SERIAL_MIN (A, n)
2. begin
3. min = A[0];
4. for i := 1 to n − 1 do
5. if (A[i] < min) min := A[i];
6. endfor;
7. return min;
8. end SERIAL_MIN
SALIZA RAMLY - CSC580
Recursive Decomposition Example
We can rewrite the loop as follows: 1. procedure RECURSIVE_MIN (A, n)
2. begin
3. if ( n = 1 ) then
4. min := A [0] ;
5. else
6. lmin := RECURSIVE_MIN ( A, n/2 );
7. rmin := RECURSIVE_MIN ( &(A[n/2]), n - n/2 );
8. if (lmin < rmin) then
9. min := lmin;
10. else
11. min := rmin;
12. endelse;
13. endelse;
14. return min;
15. end RECURSIVE_MIN
SALIZA RAMLY - CSC580
Recursive Decomposition Example
The code in the previous foil can be decomposed naturally using a recursive
decomposition strategy. We illustrate this with the following example of finding
the minimum number in the set {4, 9, 1, 7, 8, 11, 2, 12}. The task dependency
graph associated with this computation is as follows:
SALIZA RAMLY - CSC580
Data Decomposition
o Identify the data on which computations are performed.
o Partition this data across various tasks.
o This partitioning induces a decomposition of the problem.
o Data can be partitioned in various ways - this critically impacts performance of
a parallel algorithm.
SALIZA RAMLY - CSC580
Data Decomposition: Output Data
Decomposition
o Often, each element of the output can be computed independently of others
(but simply as a function of the input).
o A partition of the output across tasks decomposes the problem naturally.
SALIZA RAMLY - CSC580
Output Data Decomposition: Example
Consider the problem of multiplying two n x n matrices A and B to yield matrix C.
The output matrix C can be partitioned into four tasks as follows:
SALIZA RAMLY - CSC580
Output Data Decomposition: Example
A partitioning of output Decomposition I Decomposition II
data does not result in a
Task 1: C1,1 = A1,1 B1,1 Task 1: C1,1 = A1,1 B1,1
unique decomposition into
tasks. For example, for the Task 2: C1,1 = C1,1 + A1,2 B2,1 Task 2: C1,1 = C1,1 + A1,2 B2,1
same problem as in Task 3: C1,2 = A1,1 B1,2 Task 3: C1,2 = A1,2 B2,2
previous foil, with identical
output data distribution, Task 4: C1,2 = C1,2 + A1,2 B2,2 Task 4: C1,2 = C1,2 + A1,1 B1,2
we can derive the following Task 5: C2,1 = A2,1 B1,1 Task 5: C2,1 = A2,2 B2,1
two (other)
Task 6: C2,1 = C2,1 + A2,2 B2,1 Task 6: C2,1 = C2,1 + A2,1 B1,1
decompositions:
Task 7: C2,2 = A2,1 B1,2 Task 7: C2,2 = A2,1 B1,2
Task 8: C2,2 = C2,2 + A2,2 B2,2 Task 8: C2,2 = C2,2 + A2,2 B2,2
SALIZA RAMLY - CSC580
Output Data Decomposition: Example
Consider the problem of counting the instances of given itemsets in a database
of transactions. In this case, the output (itemset frequencies) can be partitioned
across tasks.
SALIZA RAMLY - CSC580
Output Data Decomposition: Example
From the previous example, the following observations can be made:
If the database of transactions is If the database is partitioned across
replicated across the processes processes as well (for reasons of memory
utilization)
each task first computes partial
each task can be independently
counts. These counts are then
accomplished with no
aggregated at the appropriate
communication.
task.
SALIZA RAMLY - CSC580
Input Data Partitioning
o Generally applicable if each output can be naturally computed as a function of
the input.
o In many cases, this is the only natural decomposition because the output is not
clearly known a-priori (e.g., the problem of finding the minimum in a list,
sorting a given list, etc.).
o A task is associated with each input data partition. The task performs as much
of the computation with its part of the data. Subsequent processing combines
these partial results.
SALIZA RAMLY - CSC580
Input Data Partitioning: Example
In the database counting example, the input (i.e., the transaction set) can be
partitioned. This induces a task decomposition in which each task generates
partial counts for all itemsets. These are combined subsequently for aggregate
counts.
SALIZA RAMLY - CSC580
Partitioning Input and Output Data
Often input and output
data decomposition
can be combined for a
higher degree of
concurrency. For the
itemset counting
example, the
transaction set (input)
and itemset counts
(output) can both be
decomposed as
follows:
SALIZA RAMLY - CSC580
Intermediate Data Partitioning
o Computation can often be viewed as a sequence of transformation from the
input to the output data.
o In these cases, it is often beneficial to use one of the intermediate stages as a
basis for decomposition.
SALIZA RAMLY - CSC580
Intermediate Data Partitioning: Example
Let us revisit the
example of dense
matrix multiplication.
We first show how we
can visualize this
computation in terms
of intermediate
matrices D.
SALIZA RAMLY - CSC580
Intermediate Data Partitioning: Example
A decomposition of
intermediate data structure
leads to the following
decomposition into 8 + 4
tasks:
SALIZA RAMLY - CSC580
Intermediate Data Partitioning: Example
The task dependency graph for the decomposition (shown in previous foil) into
12 tasks is as follows:
SALIZA RAMLY - CSC580
The Owner Computes Rule
o The Owner Computes Rule generally states that the process assigned a
particular data item is responsible for all computation associated with it.
In the case of INPUT data In the case of OUTPUT data
decomposition decomposition
the owner computes rule implies the owner computes rule implies
that all computations that use that the output is computed by
the input data are performed by the process to which the output
the process. data is assigned.
SALIZA RAMLY - CSC580
LECTURE5:
NEXT! PARALLEL
ALGORITHM DESIGN
(PART 2)
SALIZA RAMLY - CSC580