Parallel Computing
MPI Collective communication
Thorsten Grahs, 18. May 2015
Table of contents
Collective Communication
Communicator
Intercommunicator
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2
Collective Communication
Communication involving a group of processes
Selection of the collective group by a suitable
communicator
All communication members get an identical call.
No tags
Collective communication
...does not necessarily mean all processes
(i.e. global communication)
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 3
Collective Communication
Amount of data sent must exactly match the amount of
data received
Collective routines are collective across an entire
communicator and must be called in the same order from
all processors within the communicator
Collective routines are all blocking
Buffer can be reused upon return
Collective routines may return as soon as the calling
process participation is complete
No mixing of collective and point-to-point communication
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 4
Collective Communication functions
Barrier operation
MPI_Barrier()
All tasks waiting for each other
Broadcast operation
MPI_Bcast()
One task sends to all
Accumulation operation
MPI_Reduce()
One task associated / acts on distributed data
Gather operation
MPI_Gather()
One task collects/gather data
Scatter operation
MPI_Scatter()
One task scatter data (e.g. a vector)
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 5
Multi-Task functions
Multi-Broadcast operation
MPI_Allgather()
All participating tasks make the data available to other
participating tasks
Multi-Accumulation operation
MPI_Allreduce()
All participating tasks get result of the operation
Total exchange
MPI_Alltoall()
Each involved task sends and receives to/from all
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 6
Synchronisation
Barrier operation
MPI_Barrier(comm)
All tasks in comm wait on each other to achieve a barrier.
Only collective routine which provides explicit
synchronization
Returns at any processor only after all processes have
entered the call
Barrier can be used to ensure all processes have reached
a certain point in the computation
Mostly used for synchronization sequence of tasks
(e.g. debugging)
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 7
Example: MPI_Barrier
Tasks are waiting on each other
MPI_Isend is not completed
Data can not be accessed.
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 8
Broadcast operation
MPI_Bcast(buffer,count,datatype,root,communicator)
All processes in the communicator use same function call.
Data from rank root process are distributed to all process
in the communicator
The call is blocking, but not connected to synchronization
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 9
Accumulation operation
MPI_Reduce(sendbf,recvbf,count,type,op,master,comm)
Calling process is master
Join operation op (e.g. summation)
Processes involved put their local data into sendbf
master collects results into recvbf
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 10
Reduce operation
Pre-defined operations
MPI_MAX
MPI_MAXLOC
MPI_MIN
MPI_SUM
MPI_PROD
MPI_LXOR
MPI_BXOR
...
maximum
maximum and index of maximum
minimum
summation
product
logical exclusive OR
bitwise exclusive OR
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 11
Example: Reduce Summation
MPI_Reduce(teil,s,1,MPI_DOUBLE,MPI_SUM,0,comm)
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 12
Gather operation
MPI_Gather(sbf,scount,stype,rbf,rcount,rtype,ma,comm)
sbuf local send-buffer
rbuf receive-buffer from master ma
Each processor sends rcount elements of data type
rtype to master ma
Order of data in the rbuf corresponds to numerical order
in communicator comm
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13
Scatter operation
MPI_Scatter(sbf,scount,stype,rbf,rcount,rtype,ma,comm)
Master ma distributes/scatters data from sbf
Each process receives sub-buffers from sbf in local
receive buffer rbf
Master ma sends to itself
Order of data in the rbuf corresponds to numerical order
in communicator comm
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 14
Example: Scatter
Three processes involved in comm
Send-buffer: int sbuf[6]={3,14,15,92,65,35};
Recieve-buffer:
int rbuf[2];
Function call
MPI_Scatter(sbuf,2,MPI_INT,rbuf,2,MPI_INT,0,comm);
leads to the following distribution:
Process
rbuf
0
{ 3, 14}
1
{15, 92}
2
{65, 35}
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 15
Example Scatter-Gather: Averaging
1
2
if (world_rank == 0)
rand_nums = create_rand_nums(elements_per_proc * world_size);
3
4
5
// Create a buffer that will hold a subset of the random numbers
float *sub_rand_nums = malloc(sizeof(float) * elements_per_proc);
6
7
8
9
10
11
12
13
14
15
16
// Scatter the random numbers to all processes
MPI_Scatter(rand_nums, elements_per_proc, MPI_FLOAT, sub_rand_nums,
elements_per_proc, MPI_FLOAT, 0, MPI_COMM_WORLD);
// Compute the average of your subset
float sub_avg = compute_avg(sub_rand_nums, elements_per_proc);
// Gather all partial averages down to the root process
float *sub_avgs = NULL;
if (world_rank == 0)
sub_avgs = malloc(sizeof(float) * world_size);
MPI_Gather(&sub_avg,1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT, 0,MPI_COMM_WORLD);
17
18
19
20
// Compute the total average of all numbers.
if (world_rank == 0)
float avg = compute_avg(sub_avgs, world_size);
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 16
Multi-broadcast operation
MPI_Allgather(sbuf,scount,stype,rbuf,rcount,rtype,comm)
Data from local sbuf are sent to all in rbuf
Indication of master redundant since all processes receive
the same data
MPI_Allgather corresponds to MPI_Gather followed by a
MPI_Bcast
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 17
Example Allgather: Averaging
1
2
3
4
// Gather all partial averages down to all the processes
float *sub_avgs = (float *)malloc(sizeof(float) * world_size);
MPI_Allgather(&sub_avg, 1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT,
MPI_COMM_WORLD);
5
6
7
// Compute the total average of all numbers.
float avg = compute_avg(sub_avgs, world_size);
Output
/home/th/:
Avg of all
Avg of all
Avg of all
Avg of all
mpirun -n 4 ./average 100
elements from proc 1 is 0.479736
elements from proc 3 is 0.479736
elements from proc 0 is 0.479736
elements from proc 2 is 0.479736
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18
Total exchange
MPI_Alltoall(sbuf,scount,stype,rbuf,rcount,rtype,comm)
Matrix view
Before MPI_Alltoall process k has row k of the matrix
After MPI_Alltoall process k has column k of the matrix
MPI_Alltoall corresponds to
MPI_Gather followed by a MPI_Scatter
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 19
Variable exchange operations
Variable scatter & Gather variants
MPI_Scatterv & MPI_Gatherv
Variable are:
Number of data elements that will be distributed to individual
processes
Their position in the send-buffer sbuf
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 20
Variable Scatter & Gather
Variable scatter
MPI_Scatterv(sbf,scount,displs,styp,
rbf,rcount,rtyp,ma,comm)
scount[i] contains the number of data elements which
has to be send to process i.
displs[i] defines the start of the data block for process i
relative to sbuf.
Variable gather
MPI_Gatherv(sbuf,scount,styp,
rbuf,rcount,displs,rtyp,ma,comm)
Also variable function for Allgather, Allscatter & Alltoall
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 21
Example MPI_Scatterv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/*Initialising */
if(myrank==root) init(sbuf,N);
/* Splitting work and data */
MPI_Comm_size(comm,&size);
Nopt=N/size;
Rest=N-Nopt*size;
displs[0]=0;
for(i=0;i<N;i++) {
scount[i]=Nopt;
if(i>0) displs[i]=displs[i-1]+scount[i-1]*sizeof(double);
if(Rest>0) { scount[i]++; Rest--;}
}
/* Distributing data */
MPI_Scatterv(sbuf,scount,displs,MPI_DOUBLE,rbuf,
scount[myrank],MPI_DOUBLE,root,comm);
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 22
Comparison between BLAS & Reduce
Multiplication Matrix with Vector
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 23
Example comparison
Compare different approaches
A RNM , N rows, M columns
Row-wise distribution
y=Ax
BLAS-Routine
Column-wise distribution
Reduction operation
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 24
Example row-wise
Row-wise distribution
Result vector y distributed
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25
Example row-wise BLAS
Building block Multiplikation Matrix*Vektor
BLAS (Basic Linear Algorithm subroutines) algorithm dgemv
1
2
3
4
5
6
7
8
9
10
11
void local_mv(N,M,y,A,lda,x)
{
double x[N],A[N*M],y[M],s;
/*partial sum-local op.*/
for(i=0;i<M;i++) {
s=0;
for(j=0;j<N;j++)
s+=A[i*lda+j]*x[j];
y[i]=s;
}
}
Timing
arith.
mem.access
-x
-y
-A
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26
2 N M Ta
M Tm (N, 1)
Tm (M, 1)
M Tm (N, 1)
Example row-wise vector
Task
Initial distribution:
All data at process 0
Result vector y expected at process 0
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 27
Example row-wise matrix
Operations
Distribute x to all processes: MPI_Bcast (p-1)*Tk(N)
Distribute rows of A: MPI_Scatter (p-1)*Tk(M*N)
Vector x
Matrix A
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 28
Example row-wise results
Operations
Arithmetic
2 N M Ta
Communication
(p 1) [Tk (N) + Tk (M N) + Tk (M)]
Memory access
2 M Tm (N, 1) + Tm (M, 1)
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 29
Example column-wise
Task
Distribution column-wise
Solution vector assembled by reduction operation
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30
Example column-wise vector
Distributing vector x
Vector x
MPI_Scatter
(p-1)*Tk(M)
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 31
Example column-wise matrix
Distributing matrix A
Matrix A
pack blocks in buffer
memory:
N*Tm(M,1)+M*Tm(N,1)
Sending:
(p-1)Tk(M*N)
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 32
Example column-wise result
Assemble vector y MPI_Reduce
Cost for reduction of y: log2 (p)(Tk (N) + NTa + 2Tm (N, 1))
Arithmetic
2 N M Ta
Communication
(p 1)[Tk (M) + TK (M N)] + log2 (p)Tk (N)
Memory access
N Tm (M, 1) + M TM (N, 1) + 2 log2 (p)Tm (N, 1)
Algorithm is slightly faster
Parallelization is only useful if the corresponding data
distribution is already available before the algorithm
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 33
starts
Communicator
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 34
Communicators
Motivation
Communicator: Distinguish different contexts
Conflict-free organization of groups
Integration of third party software
Example: Distinction between
library functions
application
Predefined communicators
MPI_COMM_WORLD
MPI_COMM_SELF
MPI_COMM_NULL
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 35
Duplicate communicators
MPI_Comm_dup(MPI_COMM comm, MPI_COMM &newcomm);
Creates a copy newcomm of comm
Identical process group
Allows
clear delineation
characterisation
of process groups
example
MPI_COMM myworld;
...
MPI_Comm_dup(MPI_COMM_WORLD, &myworld)
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 36
Splitting communicators
MPI_Comm_split(MPI_COMM comm, int color, int key,
MPI_COMM &newcomm);
Divides communicator comm into multiple communicators
with disjoint processor groups
MPI_Comm_split has to be called by all processes in comm
Processes with the same value of color forms a new
communicator group
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 37
Example Splitting communicator
1
2
3
4
5
6
7
8
9
10
11
MPI_COMM comm1, comm2;
MPI_Comm_size(comm,&size);
MPI_Comm_rank(comm,&rank);
i=rank%3;
j=size-rank;
if(i==0)
MPI_Comm_split(comm,MPI_UNDEFINED,0,&newcomm);
else if(i==1)
MPI_Comm_split(comm,i,j,&comm1);
else
MPI_Comm_split(comm,i,j,&comm2)
MPI_UNDEFINED returns null-handle MPI_COMM_NULL.
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 38
Example Splitting communicator
MPI_COMM_WORLD
Rang
color
key
P0
P1
1
7
P2
2
6
P3 P4
1
5
4
P5 P6
2
3
2
P7
1
1
P8
2
0
MPI_COMM_WORLD
comm1
P1 P4
2
1
P7
0
comm2
P2 P5 P8
2
1
0
P0
0
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 39
P3
1
P6
2
Free communicator group
Clean up
MPI_COMM_free(MPI_COMM *comm);
Deletes the communicator comm
Resources occupied by comm are released by MPI.
After the function call, the communicator has the value of
the null-handle MPI_COMM_NULL
MPI_COMM_free has to be called by all process, which
belongs to comm
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 40
Grouping communicators
MPI_COMM_group(MPI_COMM comm, MPI_Group grp)
Creates a process group from a communicator
More group constructors
MPI_COMM_create
Generating a communicator from the group
MPI_Group_incl
Include processes into a group
MPI_Group_excl
Exclude processes from a group
MPI_Group_range_incl
Forms a group from a simple pattern
MPI_Group_range_excl
Excludes processes from a group by simple pattern
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 41
Example: create a group
Group
grp=(a,b,c,d,e,f,g),
n=3,
rank=[5,0,2]
MPI_Group_incl(grp, n, &rank, &newgrp)
Include in new group newgrp
n=3 processes
defined by pattern rank=[5,0,2]
newgrp=(f,a,c)
MPI_Group_excl(grp, n, &rank, &newgrp)
Exclude from new group newgrp
n=3 processes
defined by pattern rank=[5,0,2]
newgrp=(b,d,e,g)
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 42
Example: create a group II
Group
grp=(a,b,c,d,e,f,g,h,i,j),,
n=3,
ranges=[[6,7,1],[1,6,2],[0,9,4]]
Ranges forms a triple [start, end, spacing]
MPI_Group_range_incl(grp, 3, ranges, &newgrp)
Include in new group newgrp
n=3 range triples defined by [[6,7,1],[1,6,2],[0,9,4]]
newgrp=(g,h,b,d,f,a,e,i)
MPI_Group_range_excl(grp, 3, ranges, &newgrp)
Exclude from new group newgrp
n=3 range triples defined by [[6,7,1],[1,6,2],[0,9,4]]
newgrp=(c,j)
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 43
Operations on communicator groups
More grouping functions
Merging groups
Intersection of groups
Difference of groups
Comparing groups
Delete/Free groups
Size of a group
Rank of a group
...
MPI_Group_union
MPI_Group_intersection
MPI_Group_difference
MPI_Group_compare
MPI_Group_free
MPI_Group_size
MPI_Group_rank
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 44
Intercommunicator
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 45
Intercommunicator
Intracommunicator
Up til now, we had only handled communication inside a
contiguous group.
This communication was inside (intra/internal) a
communicator.
Intercommunicator
A communicator who establishes a context between groups
Intercommunicators are associated with 2 groups of
disjoint processes
Intercommunicators are associated with a remote group
and a local group
The target process (destination for send, source for
receive) is its rank in the remote group.
A communicator is either intra or inter, never both
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 46
Create intercommunicator
MPI_Intercomm_create(local_comm, local_bridge,
bridge_comm, remote_bridge, tag, &newcomm )
local_comm
local Intracommunicator (handle)
local_bridge
Rank of a distinguished process in local_comm (integer)
bridge_comm
Remote intracommunication, which should be connected
to local_comm by the newly build intercommunicator
newcomm
remote_bridge
Rank of a certain process in remote communicator
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 47
Communication between groups
Function uses point-to-point communication with specified
tag between the two processes defined as bridge heads.
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 48
Example
1
2
3
4
5
6
int main(int argc, char **argv)
{
MPI_Comm myComm; /* intra-communicator local sub-group */
MPI_Comm myFirstComm; /* inter-communicator */
MPI_Comm mySecondComm; /* second inter-communicator (group 1 only) */
int memberKey, rank;
7
8
9
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
10
11
12
/* User code must generate memberKey in the range [0, 1, 2] */
memberKey = rank % 3;
13
14
15
/* Build intra-communicator for local sub-group */
MPI_Comm_split(MPI_COMM_WORLD,memberKey,rank,&myComm);
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 49
Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/* Build inter-communicators. Tags are hard-coded. */
if (memberKey == 0)
{
/*Group 0 communicates with group 1. */
MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 1,
01, &myFirstComm); }
else if (memberKey == 1)
{
/* Group 1 communicates with groups 0 and 2. */
MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 0,
01, &myFirstComm);
MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 2,
12, &mySecondComm);
}
else if (memberKey == 2)
{
/* Group 2 communicates with group 1. */
MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 1,
12, &mySecondComm);
}
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 50
Example
/* Do work ... */
1
2
switch(memberKey) /* free communicators appropriately */
{
case 1:
MPI_Comm_free(&myFirstComm);
MPI_Comm_free(&mySecondComm);
case 0:
MPI_Comm_free(&myFirstComm);
case 2:
MPI_Comm_free(&mySecondComm);
break;
}
3
4
5
6
7
8
9
10
11
12
13
14
MPI_Finalize();
15
16
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 51
Motivation Intercommunicator
Used for
Meta-Computing
Cloud-Computing
Low bandwidth between components
e.g. cluster < > pc
bridge head controls
communication with remote-computer
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 52