Report Homework 1
Eko Rudiawan Jamzuri, 60775041H
March 15, 2019
1 OpenMP Experiment
This section will describe about OpenMP experiment in exercise.
1.1 OpenMP Pragma Completion 1
1.1.1 Source Code
The experiment use different pragma directive to evaluate total execution time of the program.
There are three different pragmas that I use in this experiment, that are for, sections, and single
directive. The experiment also evaluate execution speed performance by changing array size. Each
experiment take five times program execution. Total execution time of the program will be recorded
and average execution time will be calculated manually. The overview of pragma source code is
listed below. P1, P2, and P3 in Table 1 are reference to pragma code that listed below.
1. P1 : Use for directive
1 # pragma omp p a r a l l e l
2 {
3 # pragma omp f o r
4 f o r ( i = 0 ; i <N ; i ++) {
5 CC[ i ] = A[ i ] + B [ i ] ;
6 }
7 }
2. P2 : Use sections directive
1 # pragma omp p a r a l l e l
2 {
3 # pragma omp s e c t i o n s
4 {
5 # pragma omp s e c t i o n
6 f o r ( i = 0 ; i <N ; i ++) {
7 CC[ i ] = A[ i ] + B [ i ] ;
8 }
9 }
10 }
3. P3 : Use single directive
1
1 # pragma omp p a r a l l e l
2 {
3 # pragma omp s i n g l e
4 f o r ( i = 0 ; i <N ; i ++) {
5 CC[ i ] = A[ i ] + B [ i ] ;
6 }
7 }
1.1.2 Result
Result of comparison of three different pragma directives listed in Table 1, Table 2, Table 3, and
Table 4 below. Table 1 used 100 elements array, Table 2 used 10000 elements array, Table 3 used
10000000 elements array, and Table 4 used 100000000 elements array.
Table 1: Result of execution time with N=100
Num Seq (us) P1 (ms) P2 (ms) P3 (ms)
1 0.91 12.91 4.90 7.21
2 0.93 5.02 6.73 5.22
3 0.45 7.81 6.43 5.63
4 0.96 7.17 6.75 4.70
5 0.94 6.04 8.90 5.93
Avg 0.84 7.79 6.74 5.74
Table 2: Result of execution time with N=10000
Num Seq (us) P1 (ms) P2 (ms) P3 (ms)
1 99.24 5.98 7.98 3.04
2 48.03 6.41 7.53 5.18
3 80.12 4.05 6.38 8.39
4 99.80 8.28 6.94 7.09
5 102.40 5.42 4.65 6.91
Avg 85.92 6.03 6.70 6.12
Table 3: Result of execution time with N=1000000
Num Seq (ms) P1 (ms) P2 (ms) P3 (ms)
1 5.15 6.58 16.67 14.95
2 5.15 5.95 15.41 11.54
3 5.17 8.33 14.30 13.60
4 5.18 9.19 17.52 15.45
5 5.17 7.31 16.11 16.22
Avg 5.16 7.47 16.00 14.35
Based on this experiment, the pragma that using for directive will give significant performance
when it use for big element array (more than 100000000 elements). The execution time will be
2
Table 4: Result of execution time with N=100000000
Num Seq (ms) P1 (ms) P2 (ms) P3 (ms)
1 510.64 149.16 781.36 772.94
2 511.37 160.01 773.92 776.86
3 511.69 164.86 782.42 786.86
4 511.15 155.05 777.22 776.66
5 512.97 159.05 775.20 778.52
Avg 511.56 157.63 778.02 778.37
faster compare to sequential program. But, when for directive is use to optimize small element
array, the performance is lower compare to sequential program. Both sections and single directives
have no effect to optimize that code, because in that code sections directive only have one section
block that means the code will be executed in single thread same as when using single directive.
That why the performance when using those directive is not better compare to when sequential
program.
1.2 OpenMP Pragma Completion 2
1.2.1 Source Code
The experiment method same as the first experiment, that is by changing pragma directive and array
size and then measure the execution time.
1. P1 : Use for directive
1 # pragma omp p a r a l l e l
2 {
3 # pragma omp f o r
4 f o r ( i = 0 ; i <N ; i ++) {
5 CC[ i ] = A[ i ] + B [ i ] ;
6 }
7 # pragma omp f o r
8 f o r ( i = 0 ; i <N ; i ++) {
9 p a r a l l e l S u m += CC[ i ] ;
10 }
11 }
2. P2 : Use sections directive
1 # pragma omp p a r a l l e l
2 {
3 # pragma omp s e c t i o n s
4 {
5 # pragma omp s e c t i o n
6 f o r ( i = 0 ; i <N ; i ++) {
7 CC[ i ] = A[ i ] + B [ i ] ;
8 }
9 # pragma omp s e c t i o n
10 f o r ( i = 0 ; i <N ; i ++) {
11 p a r a l l e l S u m += CC[ i ] ;
12 }
13 }
14 }
3
3. P3 : Use single directive
1 # pragma omp p a r a l l e l
2 {
3 # pragma omp s i n g l e
4 f o r ( i = 0 ; i <N ; i ++) {
5 CC[ i ] = A[ i ] + B [ i ] ;
6 }
7 # pragma omp s i n g l e
8 f o r ( i = 0 ; i <N ; i ++) {
9 p a r a l l e l S u m += CC[ i ] ;
10 }
11 }
1.2.2 Result
The experiment resulting similar result with the first experiment. For directive will have significant
impact when it use to calculate a big array. For directive has no better performance when it use to
calculate small size array compare to sequential program. Overview of the experiment can be seen
in Table 5, Table 6, and Table 7.
Table 5: Result of execution time with N=1000
Num Seq (us) P1 (ms) P2 (ms) P3 (ms)
1 5.39 7.91 4.58 7.81
2 11.20 10.17 3.94 6.70
3 5.36 9.30 6.88 10.87
4 11.17 8.84 5.47 5.43
5 5.39 4.05 8.28 6.18
Avg 7.70 8.05 5.83 7.40
Table 6: Result of execution time with N=1000000
Num Seq (ms) P1 (ms) P2 (ms) P3 (ms)
1 6.77 10.31 28.85 22.55
2 6.79 7.48 28.82 19.65
3 6.82 6.35 27.78 21.09
4 6.80 10.67 10.51 15.11
5 6.79 9.88 12.68 18.02
Avg 6.79 8.94 1.732 19.28
1.3 Bug Finding 1
The source code will have a bug because by default all variables inside pragma are shared between
threads. In this case, temp variable will be accessed by all thread and the variable value is very
possible to be overwrite by another thread. To avoid bug in the source code, need to modify pragma
to make variable temp to be a private variable bay adding private code at the end of pragma. The
code below show how to make variable temp to be private.
4
Table 7: Result of execution time with N=10000000
Num Seq (ms) P1 (ms) P2 (ms) P3 (ms)
1 67.40 44.59 285.56 124.78
2 67.36 27.64 291.63 124.02
3 67.50 35.52 254.09 136.22
4 67.42 39.22 270.00 120.61
5 67.45 31.18 254.15 126.13
Avg 67.43 35.63 271.09 126.35
1 # pragma omp p a r a l l e l f o r p r i v a t e ( temp )
2 f o r ( i = 0 ; i <N ; i ++) {
3 temp = AA[ i ] ;
4 AA[ i ] = BB[ i ] ;
5 BB[ i ] = temp ;
6 }
1.4 Bug Finding 2
The problem with the code is when it executed in parallel, x value will be shared with another
thread. Then we cannot make sure x value, it’s depend on the last thread execution. To avoid bug
in the source code, need to set pragma to be critical above increment code. That critical pragma
identifies a section of code must be executed by a single thread at a time. So if we have 8 threads,
value of x will be 8 because each thread will execute this code one times.
1 # pragma omp p a r a l l e l s h a r e d ( x )
2 {
3 # pragma omp c r i t i c a l
4 x = x + 1;
5 }
1.5 OpenMP Pragma Completion 3
To execute two sections in parallel we need to declare pragma sections. Then followed by section
block. Section means that code will be executed by single thread. So, if there are two section blocks
OpenMP will use two threads to execute those two section blocks.
1 # pragma omp p a r a l l e l
2 {
3 # pragma omp s e c t i o n s
4 {
5 # pragma omp s e c t i o n
6 f o r ( i = 0 ; i < N ; i ++)
7 c[ i ] = a[ i ] + b[ i ];
8 # pragma omp s e c t i o n
9 f o r ( i = 0 ; i < N ; i ++)
10 d[ i ] = a[ i ] ∗ b[ i ];
11 } / ∗ end o f s e c t i o n s ∗ /
12 } / ∗ end o f p a r a l l e l r e g i o n ∗ /
5
2 Exercise 1: Matrix Multiplication Optimization
This section will describe about experiment of matrix multiplication.
2.1 Serialize Matrix Multiplication
The first experiment of matrix multiplication is implementing standard nested for-loop calculation
for each row and column. Evaluation of algorithm performance by changing matrix size and mea-
sure the execution time of matrix multiplication program. The experiment resulting data that listed
in Table 8. The more bigger matrix size, the more longer execution time.
Table 8: Serialize Matrix Multiplication Performance
Num N=100 (ms) N=300 (ms) N=700 (ms)
1 6.80 130.32 1851.77
2 7.01 130.97 1870.68
3 11.52 134.08 1852.36
4 11.91 130.13 1852.32
5 11.15 130.75 1855.69
Avg 9.68 131.25 1856.56
2.2 Serialize Matrix Multiplication with Transpose
The second experiment of matrix multiplication is implementing transpose method. The implemen-
tation of transpose method have a better performance compare to standard matrix multiplication.
Table 9 show a result of the experiment.
Table 9: Serialize Matrix Multiplication With Transpose Performance
Num N=100 (ms) N=300 (ms) N=700 (ms)
1 7.06 120.99 1529.96
2 6.58 121.08 1530.14
3 7.14 120.97 1529.81
4 6.33 121.09 1530.13
5 8.12 121.02 1531.94
Avg 7.07 121.03 1590.40
2.3 Parallel Matrix Multiplication
The third experiment of matrix multiplication is implementing parallel nested for-loop to speed-up
execution time of the first experiment. The optimization of this experiment using for-loop directive
to make for-loop code is executed in parallel. After optimization with parallel execution, the ex-
ecution time reduce significantly compared to original code. The result of this experiment shown
in Table 10. Speed-up factor when the code is executed in parallel mode can be calculated by
dividing non-optimize execution time with parallel execution time. Based on the comparison, the
6
Table 10: Parallel Matrix Multiplication Performance
Num N=100 (ms) N=300 (ms) N=700 (ms)
1 8.52 61.03 713.06
2 9.92 59.65 710.79
3 10.36 65.45 726.01
4 6.14 58.17 736.89
5 5.77 53.07 704.14
Avg 8.14 59.47 718.18
speed-up factor rise 2.58 times when the code is used to calculate 700x700 matrix size. The detail
of speed-up factor value can be seen in Table 11.
Table 11: Speed Factor Standard Matrix Multiplication Using Parallel Computation
Matrix Size Non Optimize (ms) Optimize (ms) Speed-Up Factor (X)
100x100 9.68 8.14 1.19
300x300 131.25 59.47 2.21
700x700 1856.56 718.18 2.58
2.4 Parallel Matrix Multiplication with Transpose
The fourth experiment is implementing parallel optimization to matrix multiplication with trans-
pose method. Same as the third experiment, the code will be optimize to rise up for-loop execution
timing using for directives. After trying several experiment with different size of matrix, the exe-
cution time of this method is the best to standard matrix multiplication, standard transpose method,
and optimize standard matrix multiplication. The result of this experiment listed in Table 11
Table 12: Parallel Matrix Multiplication With Transpose Performance
Num N=100 (ms) N=300 (ms) N=700 (ms)
1 12.81 120.60 678.88
2 11.78 126.33 679.04
3 13.89 123.17 679.55
4 15.09 63.89 681.10
5 13.34 122.50 680.53
Avg 13.38 111.20 679.82
The speed-up factor of this method shown in Table 13
2.5 Overall Matrix Multiplication Performance
Overall performance of matrix multiplication method shown in Table 14. By the evaluation, par-
allel matrix multiplication using transpose method is the best method. When using this method
to calculate matrix multiplication with big size 700x700, it needs only 679.82 ms execution time.
7
Table 13: Speed Factor Transpose Matrix Multiplication Using Parallel Computation
Matrix Size Non Optimize (ms) Optimize (ms) Speed-Up Factor (X)
100x100 7.07 13.38 0.1
300x300 121.03 111.20 1.06
700x700 1590.40 679.82 2.34
In the other hand, when calculating small size matrix multiplication (100x100), the non-optimize
transpose method is giving best performance compare to other method. It’s only need 7.07 ms
to multiply the matrix. This experiment conclude that parallel optimization will give a significant
impact when it is used to calculate a big matrix or to execute a big for-loop iteration.
Table 14: Overall Matrix Multiplication Performance
Size MatMul (ms) MatMul Trans (ms) Parallel MatMul (ms) Parallel MatMul Trans (ms)
100 9.68 7.07 8.14 13.38
300 131.25 121.03 59.47 111.20
700 1856.56 1590.40 718.18 679.82