Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
21 views13 pages

Assignment (T)

The document discusses improving the performance of two computational codes through parallelization. For the first code, a Monte Carlo pi calculation, the MPI version showed near-linear speedup as processes were increased from 1 to 8. Communication overhead became the limiting factor as processes approached the number of calculations. With millions of calculations, speedup could scale to millions of nodes. The second code, a 3D heat conduction simulation, was parallelized with OpenMP. Dividing the computation across the k-loop performed faster than the i-loop due to less data redistribution. Private variables were required depending on the parallelized loop.

Uploaded by

shs5feb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views13 pages

Assignment (T)

The document discusses improving the performance of two computational codes through parallelization. For the first code, a Monte Carlo pi calculation, the MPI version showed near-linear speedup as processes were increased from 1 to 8. Communication overhead became the limiting factor as processes approached the number of calculations. With millions of calculations, speedup could scale to millions of nodes. The second code, a 3D heat conduction simulation, was parallelized with OpenMP. Dividing the computation across the k-loop performed faster than the i-loop due to less data redistribution. Private variables were required depending on the parallelized loop.

Uploaded by

shs5feb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

High Performance Compu ng

1. A sequen al code of Monte Carlo Pi calcula on is given in the PDF le of MPI programming. Write its
MPI version and run it on a parallel computer to discuss the performance improvement and
scalability by increasing the number of MPI processes up to 8. In the MPI code, the execu on me of
the Pi calcula on must be measured by properly using MPI_W me and be printed out (prin ) at the
end of the execu on. What is the limi ng factor of its performance? Will it scale to a large-scale
system of million nodes?

The MPI version of the program is shown on the next page tled “pi_mpi.c”. A er compiling the code,
it was executed on a parallel computer by varying the number of processes up to 8. The execu on
me, speedup, and e ciency are summarized in Table 1.

Table 1. Summary of the execu on me, speedup, and e ciency for several number of processes of
Monte Carlo MPI program.

Number of Execu on me (secs) Speedup E ciency


Process
1 (sequen al) 10.91626 1.00000 1
2 5.458785 1.99976 0.99988
3 3.639033 2.99977 0.99992
4 2.735953 3.98993 0.99748
5 2.183623 4.99915 0.99983
6 1.819824 5.99853 0.99975
7 1.563699 6.98105 0.99729
8 1.368491 7.97686 0.99711

Figure 1. E ciency vs number of Processes of Pi_MPI.


Upon running this Monte Carlo simula on, the number of N was xed to 100000000. The only
communica on required for this program is to gather all the results in the very ending of the
program. There will be data as many as the number of processes to be sent. The general trend, as can
be seen from Fig. 1, is that the e ciency decreases as the number of processes increase. However,
since for this par cular case a huge number of N is used, the communica on percentage is very
minor to the process of genera ng many random numbers. As the number of processes increase, the
random numbers that should be generated by each process become less and the data to be
transferred at the end will increase. We could see then that the communica on overhead becomes
the limi ng factor, even though it can only be felt when the number of the processes is close to the
ti
ffi
ti
ti
ti
ti
ti
ffi
ti
ti
ti
ti
ffi
ti
ti
ti
ffi
ti
ti
ti
ti
ffi
ti
ti
fi
fi
ti
ti
ft
ti
tf
ti
ti
order of N. Nevertheless, since our program to nd pi is basically embarrassingly parallel, it is thus
clear that more processes can be added up to million nodes as long as it below N , and a speedup can
be obtained in theory.

2. Download a sequen al code of nal.c available at the class web page. It is a simpli ed version of 3-
dimensional heat conduc on simula on. The code is also shown on the next page. The computa on
results on one slice of the 3-dimensional space are printed out as an image le in the PGM format.

I. Parallelize the code with OpenMP, considering which variables should be private. You can check
the correctness by comparing the output data with those of the original code.

The OpenMP code is shown on the next page tled “heat_omp.c”.

There are actually several ways to apply OpenMP to the current problem. First, we can divide the
k-loop into a number of threads. By doing so, the variable “j” and “i” are required to be private.
We can also divide the j-loop (thus requiring “i” to be private), as well as dividing i-loop (no
private variables are required). In comparing the two extreme cases, i.e. dividing k-loop and i-
loop, the par ons on k-loop performs faster compared to the i-loop par ons. This is since for
every i-loop par on, the compiler is required to redistribute the task for each thread, while k-
loop par on only requires distribu on once. The minor disadvantage of k-loop par on is that
it stores private “i” and “j” for every thread. For a comparison with 4 threads, k-loop par on
requires 194.84 secs while i-loop par on require 341.22 secs. The program a ached in this le
is the k-loop par on with OpenMP.

The comparison of the output.pgm and output_omp.pgm (Fig.2 and Fig. 3) verify that the
OpenMP version of the program produces the correct results.
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
fi
ti
ti
ti
ti
ti
fi
ti
ti
fi
tt
fi
ti
ti
ti
ti
fi
ti
Figure 2. Sequen al result Figure 3. OpenMP result

II. Under the assump on of a speci c domain decomposi on method, describe necessary
communica ons if the code is parallelized with MPI.

The domain decomposi on was performed on the highest level loop, i.e. “k”. Thus, every process
only calculates T from kini al un l k nish. However, for every me step, the informa on of the
temperature on kini al-1 and k nisih+1 from the previous me step is required. This is where the
communica on becomes necessary. The schema c of this communica on is depicted on Fig. 4.
This communica on u lized peer-to-peer communica on without blocking.

Figure 4. Schema c of the communica on


ti
ti
ti
ti
ti
ti
ti
ti
ti
fi
ti
ti
ti
ti
ti
ti
ti
ti
A er reaching me TMAX, each of the processes stores the current value of temperature from
kini al un l k nish. Thus, to complete the program, gather opera on can be performed to collect all
of the temperature values ranging from 0 un l KMAX. Upon wri ng this MPI version of the heat
transfer, since the print() func on only prints par cular slice of the data, a technique to
determine which process should perform the print() func on was used rather than gather
opera on, to reduce me.

III. Write an MPI version of the code, and discuss the scalability and parallel e ciency of the
program considering the communica on overhead.

The MPI version of this code is shown on the next page tled “heat_mpi.c” and the result is
shown in Fig. 5. As can be understood from the code a ached, it can be seen that the current
MPI code can only be run on 3 or more nodes. The execu on me, speedup, and e ciency for
1,3 and 4 processes are summarized in Table 2 and the trend is illustrated through Fig. 6.

Figure 5. MPI result

Table 2. Summary of the execu on me, speedup, and e ciency for several number of processes of
heat transfer MPI program.

Number of Execu on me (secs) Speedup E ciency


Process
1 (sequen al) 734.36 1 1
ffi
ft
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ffi
tt
ti
ti
ti
ti
ti
ti
ffi
ffi
3 305.93 2.400 0.800
4 255.33 2.876 0.719

E ciency vs Number of Processes of Heat_MPI


1

0.75
E ciency

0.5

0.25

0
0 1 2 3 4
Number of Processes

Figure 6. E ciency vs Number of Processes of Heat_MPI

From the trend, we can see that the e ciency greatly decreases with the number of processes.
This is since for every addi onal process, the number of the communica on increases as much
as 2*JMAX*IMAX. The more the number of processes, the more the communica ons which
increases the overhead. There will be a point where the e ciency is lower than 0, i.e. when
there will be no more speedup gain. Usually, to nd out the op mum processes required is
achieved through trial and error for every speci c problem.

IV. Is there any other idea to improve the parallel e ciency of this program?

To improve the parallel e ciency of this par cular heat problem, a way to reduce the
communica on overhead should be done. One of the possibili es is to reduce the surface area
of the interface between the boundary of each par on. The par cular example for the case of 4
processes is depicted in Fig. 7. On the le is the methodology upon wri ng the program, i.e. k-
loop par on. On the right is the proposed idea to improve the parallel e ciency. As can be
seen, the interface of the boundaries between each par ons on the le gure is 3*IMAX*JMAX
while on the right we see that the area of the interface is IMAX*JMAX + KMAX*JMAX. Since
IMAX, JMAX, KMAX are equal, thus the communica on required for the proposed idea is 2/3 of
the original one. By reducing the communica on required, we can improve the e ciency for the
same number of processes. For more processes, the idea can be extended straigh orward.
ffi
ffi
ti
ti
ffi
ti
ti
ffi
ffi
ft
ti
fi
ffi
ti
ti
fi
ti
ti
ti
ti
ffi
ti
ti
ti
ft
ti
fi
ti
ffi
ffi
tf
ti
Figure 7. (a) original domain decomposi ons, (b) proposed domain decomposi ons
ti
ti
pi_mpi.c
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <time.h>

#define MASTER 0
int main(int argc,char* argv[])
{
int numtasks, taskid, loop;
int rc;
int N=100000000;
int i, total = 0;
double x,y,pi;
double etime;

/*MPI initialization*/
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &taskid);
printf ("MPI task %d has started...\n", taskid);
MPI_Barrier(MPI_COMM_WORLD);
etime = -MPI_Wtime();
loop = N/numtasks; /*problem distribution*/

srand(time(NULL)*taskid);
for(i=0;i<loop;i++){
x = (double)rand()/RAND_MAX;
y = (double)rand()/RAND_MAX;
if (x*x + y*y < 1){
total=total+1;
}
}
printf("task id %d has total %d from %d loops\n",taskid,total,loop);
MPI_Barrier(MPI_COMM_WORLD);

int hometotal;
rc = MPI_Reduce(&total, &hometotal, 1, MPI_INT, MPI_SUM, 0,
MPI_COMM_WORLD); /*summing all points landing inside circle*/

/*final pi calculation*/
if(taskid==0){
pi = 4*(double)hometotal/N;
printf("pi is: %f\n",pi);
etime+= MPI_Wtime();
printf("elapsed time: %lf [sec]\n", etime);
}

MPI_Finalize();
return 0;
}

heat_omp.c
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <omp.h>

#define IMAX (512)


#define JMAX (512)
#define KMAX (512)
#define TMAX (500)

#define T(i,j,k,b) \
temp[(b)*IMAX*JMAX*KMAX+(k)*JMAX*IMAX+(j)*IMAX+(i)]

static void initmat(); /* initialize array */


static void print(); /* print one slice of 3d array */
static double gettime(); /* time measurement */

double* temp = NULL; /* temperature */

int main(int argc, char* argv)


{
int i,j,k,t; /* loop indecies */
int rb = 0, wb = 1; /* for double-buffering */
double stime; /* start */
double etime; /* end */

initmat(); /* setup the array */

stime = gettime();
/* kernel loop */
for(t=0;t<TMAX;t++){
/*omp definition*/
#pragma omp parallel private(j,i)
#pragma omp for
for(k=1;k<KMAX-1;k++){
for(j=1;j<JMAX-1;j++){
for(i=1;i<IMAX-1;i++){
T(i,j,k,wb) = (T(i,j,k,rb)
+ T(i,j,k-1,rb)
+ T(i,j,k+1,rb)
+ T(i,j-1,k,rb)
+ T(i,j+1,k,rb)
+ T(i-1,j,k,rb)
+ T(i+1,j,k,rb))/7;
}
}
}
rb = (rb==0?1:0);
wb = (wb==0?1:0); /* wb != rb */
fprintf(stderr,"%d\n",t);
}
etime = gettime();
fprintf(stderr,"elapsed: %lf [sec]\n",etime-stime);

print(); /* print out the result */

return 0;
}

void initmat(void)
{
int i,j,k;
temp = (double*)malloc(IMAX*JMAX*KMAX*2*sizeof(double));

for(k=0;k<KMAX;k++){
for(j=0;j<JMAX;j++){
for(i=0;i<IMAX;i++){
if(i==0 || i==IMAX-1){
T(i,j,k,0) = 255;
T(i,j,k,1) = 255;
}
else if(j==0 || j==JMAX-1){
T(i,j,k,0) = 255;
T(i,j,k,1) = 255;
}
else if(k==0 || k==KMAX-1){
T(i,j,k,0) = 255;
T(i,j,k,1) = 255;
}
else {
T(i,j,k,0) = 0;
T(i,j,k,1) = 0;
}
}
}
}
}

void print(void)
{
FILE *out_file = fopen("output_omp.txt","w");
int i,j,k;

fprintf(out_file,"P2\n%d %d 255\n",IMAX,JMAX);
for(j=0;j<JMAX;j++)
for(i=0;i<IMAX;i++)
fprintf(out_file,"%d ",(unsigned char)T(i,j,KMAX/3,0));
}

double gettime()
{
struct timeval tv;
gettimeofday(&tv,NULL);
return tv.tv_sec + tv.tv_usec/1000000.0;
}

heat_mpi.c
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <mpi.h>
#include <math.h>
#include <time.h>

#define IMAX (512)


#define JMAX (512)
#define KMAX (512)
#define TMAX (500)

#define T(i,j,k,b) \
temp[(b)*IMAX*JMAX*KMAX+(k)*JMAX*IMAX+(j)*IMAX+(i)]

static void initmat(); /* initialize array */


static void print(); /* print one slice of 3d array */
static double gettime(); /* time measurement */

double* temp = NULL; /* temperature */

int main(int argc, char*** argv)


{
int i,j,k,t; /* loop indecies */
int rb = 0, wb = 1; /* for double-buffering */
double stime; /* start */
double etime; /* end */

/* domain distribution variable */


int id, p, mod, flr, asg, ini, end,
rc,master;
MPI_Status status;

/*MPI initialization*/
MPI_Init(&argc,argv);
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Comm_size(MPI_COMM_WORLD, &p);

initmat(); /* setup the array - every process store whole array */

/*domain distribution*/
mod=(KMAX-2)%p;
flr = ((KMAX-2)-mod)/p;
asg = (id<mod)?flr+1:flr;
ini = (id<=mod)?1+id*(flr+1):1+id*flr+mod;
end = ini+asg;
master = (ini<=KMAX/3 && KMAX/3<end)?id:-1; /* master is variable which
stores T
in the domain KMAX/3 */

fprintf(stderr,"master:%d process %d start: %d\t end:%d\n",master,


id,ini,end);
MPI_Barrier(MPI_COMM_WORLD);
if(id==master){
stime = MPI_Wtime();}

/* kernel loop - each process only update their respective k_range */


for(t=0;t<TMAX;t++){
for(k=ini;k<end;k++){
for(j=1;j<JMAX-1;j++){
for(i=1;i<IMAX-1;i++){
T(i,j,k,wb) = (T(i,j,k,rb)
+ T(i,j,k-1,rb)
+ T(i,j,k+1,rb)
+ T(i,j-1,k,rb)
+ T(i,j+1,k,rb)
+ T(i-1,j,k,rb)
+ T(i+1,j,k,rb))/7;
}
}
}
MPI_Barrier(MPI_COMM_WORLD); /*wait until all processes are done*/

/*communication to send value at the boundary and the interfaces*/


for(j=1;j<JMAX-1;j++){
for(i=1;i<IMAX-1;i++){
if(id!=0 && id!=p-1){
rc =
MPI_Send(&T(i,j,ini,wb),1,MPI_DOUBLE,id-1,1,MPI_COMM_WORLD);
rc =
MPI_Send(&T(i,j,end-1,wb),1,MPI_DOUBLE,id+1,2,MPI_COMM_WORLD);
rc =
MPI_Recv(&T(i,j,ini-1,wb),1,MPI_DOUBLE,id-1,2,MPI_COMM_WORLD,&status);
rc =
MPI_Recv(&T(i,j,end,wb),1,MPI_DOUBLE,id+1,1,MPI_COMM_WORLD,&status);
}
else if(id==0){
rc =
MPI_Send(&T(i,j,end-1,wb),1,MPI_DOUBLE,id+1,2,MPI_COMM_WORLD);
rc =
MPI_Recv(&T(i,j,end,wb),1,MPI_DOUBLE,id+1,1,MPI_COMM_WORLD,&status);
}
else if(id==p-1){
rc =
MPI_Send(&T(i,j,ini,wb),1,MPI_DOUBLE,id-1,1,MPI_COMM_WORLD);
rc =
MPI_Recv(&T(i,j,ini-1,wb),1,MPI_DOUBLE,id-1,2,MPI_COMM_WORLD,&status);
}
}
}

MPI_Barrier(MPI_COMM_WORLD);

rb = (rb==0?1:0);
wb = (wb==0?1:0); /* wb != rb */
if(id==master){
fprintf(stderr,"%d\n",t);}

MPI_Barrier(MPI_COMM_WORLD);

if(id==master){
etime = MPI_Wtime();
fprintf(stderr,"elapsed: %lf [sec]\n",etime-stime);
print(); /* print out the result by master*/
}
MPI_Finalize();
return 0;
}

void initmat(void)
{
int i,j,k;
temp = (double*)malloc(IMAX*JMAX*KMAX*2*sizeof(double));

for(k=0;k<KMAX;k++){
for(j=0;j<JMAX;j++){
for(i=0;i<IMAX;i++){
if(i==0 || i==IMAX-1){
T(i,j,k,0) = 255;
T(i,j,k,1) = 255;
}
else if(j==0 || j==JMAX-1){
T(i,j,k,0) = 255;
T(i,j,k,1) = 255;
}
else if(k==0 || k==KMAX-1){
T(i,j,k,0) = 255;
T(i,j,k,1) = 255;
}
else {
T(i,j,k,0) = 0;
T(i,j,k,1) = 0;
}
}
}
}
}
void print(void)
{
FILE *out_file = fopen("output_mpi.txt","w");
int i,j,k;

fprintf(out_file,"P2\n%d %d 255\n",IMAX,JMAX);
for(j=0;j<JMAX;j++)
for(i=0;i<IMAX;i++)
fprintf(out_file,"%d ",(unsigned char)T(i,j,KMAX/3,0));
}

double gettime()
{
struct timeval tv;
gettimeofday(&tv,NULL);
return tv.tv_sec + tv.tv_usec/1000000.0;
}

You might also like