OpenMP Tutorial
Seung-Jai Min
([email protected])
School of Electrical and Computer Engineering
Purdue University, West Lafayette, IN
ECE 563 Programming Parallel Machines 1
Parallel Programming Standards
• Thread Libraries
- Win32 API / Posix threads
• Compiler Directives OUR FOCUS
- OpenMP (Shared memory programming)
• Message Passing Libraries
- MPI (Distributed memory programming)
ECE 563 Programming Parallel Machines 2
Shared Memory Parallel
Programming in the Multi-Core Era
• Desktop and Laptop
– 2, 4, 8 cores and … ?
• A single node in distributed memory clusters
– Steele cluster node: 2 8 (16) cores
• Shared memory hardware Accelerators
• Cell processors: 1 PPE and 8 SPEs
• Nvidia Quadro GPUs: 128 processing units
ECE 563 Programming Parallel Machines 3
OpenMP:
Some syntax details to get us started
• Most of the constructs in OpenMP are compiler
directives or pragmas.
– For C and C++, the pragmas take the form:
#pragma omp construct [clause [clause]…]
– For Fortran, the directives take one of the forms:
C$OMP construct [clause [clause]…]
!$OMP construct [clause [clause]…]
*$OMP construct [clause [clause]…]
• Include files
#include “omp.h”
ECE 563 Programming Parallel Machines 4
How is OpenMP typically used?
• OpenMP is usually used to parallelize loops:
• Find your most time consuming loops.
• Split them up between threads.
Parallel Program
Sequential Program
#include “omp.h”
void main() void main()
{ {
int i, k, N=1000; int i, k, N=1000;
double A[N], B[N], C[N]; double A[N], B[N], C[N];
for (i=0; i<N; i++) { #pragma omp parallel for
A[i] = B[i] + k*C[i] for (i=0; i<N; i++) {
} A[i] = B[i] + k*C[i];
} }
}
ECE 563 Programming Parallel Machines 5
How is OpenMP typically used?
(Cont.)
• Single Program Multiple Data (SPMD)
Parallel Program
#include “omp.h”
void main()
{
int i, k, N=1000;
double A[N], B[N], C[N];
#pragma omp parallel for
for (i=0; i<N; i++) {
A[i] = B[i] + k*C[i];
}
}
ECE 563 Programming Parallel Machines 6
How is OpenMP typically used?
(Cont.)
• Single Program Multiple Data (SPMD)
Thread 0
Thread 1
void main() Thread 2
{ void main() Thread 3
{
int i, k, N=1000; void main()
int i,C[N];
double A[N], B[N], {
k, N=1000; void main()
lb = 0; int i,C[N];
double A[N], B[N], {
k, N=1000;
ub = 250; lb = 250; int i,C[N];
double A[N], B[N], k, N=1000;
ub = 500;
for (i=lb;i<ub;i++) { lb = 500; double A[N], B[N], C[N];
A[i] = B[i] for ub = 750;
(i=lb;i<ub;i++)
+ k*C[i]; { lb = 750;
} A[i] = B[i] for
+ k*C[i]; ub = 1000;
(i=lb;i<ub;i++) {
} } A[i] = B[i] for (i=lb;i<ub;i++) {
+ k*C[i];
} } A[i] = B[i] + k*C[i];
} }
}
ECE 563 Programming Parallel Machines 7
OpenMP Fork-and-Join model
printf(“program begin\n”);
N = 1000; Serial
#pragma omp parallel for
for (i=0; i<N; i++) Parallel
A[i] = B[i] + C[i];
M = 500; Serial
#pragma omp parallel for
for (j=0; j<M; j++) Parallel
p[j] = q[j] – r[j];
printf(“program done\n”); Serial
ECE 563 Programming Parallel Machines 8
OpenMP Constructs
1. Parallel Regions
– #pragma omp parallel
2. Worksharing
– #pragma omp for, #pragma omp sections
3. Data Environment
– #pragma omp parallel shared/private (…)
4. Synchronization
– #pragma omp barrier
5. Runtime functions/environment variables
– int my_thread_id = omp_get_num_threads();
– omp_set_num_threads(8);
ECE 563 Programming Parallel Machines 9
OpenMP: Structured blocks
• Most OpenMP constructs apply to structured
blocks.
– Structured block: one point of entry at the top and
one point of exit at the bottom.
– The only “branches” allowed are STOP statements in
Fortran and exit() in C/C++.
ECE 563 Programming Parallel Machines 10
OpenMP: Structured blocks
A Structured Block Not A Structured Block
#pragma omp parallel if(count==1) goto more;
{ #pragma omp parallel
more: do_big_job(id); {
if(++count>1) goto more; more: do_big_job(id);
} if(++count>1) goto done;
printf(“ All done \n”); }
done: if(!really_done()) goto more;
ECE 563 Programming Parallel Machines 11
Structured Block Boundaries
• In C/C++: a block is a single statement or a group of
statements between brackets {}
#pragma omp parallel #pragma omp for
{ for (I=0;I<N;I++) {
id = omp_thread_num(); res[I] = big_calc(I);
A[id] = big_compute(id); A[I] = B[I] + res[I];
} }
ECE 563 Programming Parallel Machines 12
Structured Block Boundaries
• In Fortran: a block is a single statement or a group of
statements between directive/end-directive pairs.
C$OMP PARALLEL C$OMP PARALLEL DO
10 W(id) = garbage(id) do I=1,N
res(id) = W(id)**2 res(I)=bigComp(I)
if(res(id) goto 10 end do
C$OMP END PARALLEL C$OMP END PARALLEL DO
ECE 563 Programming Parallel Machines 13
OpenMP Parallel Regions
double A[1000];
omp_set_num_threads(4);
#pragma omp parallel
{
int ID = omp_get_thread_num();
double A[1000]; pooh(ID, A);
}
omp_set_num_threads(4) printf(“all done\n”);
A single
copy of “A”
is shared
pooh(0,A) pooh(1,A) pooh(2,A) pooh(3,A)
between all
threads.
printf(“all done\n”); Implicit barrier: threads wait here for
all threads to finish before proceeding
ECE 563 Programming Parallel Machines 14
The OpenMP API
Combined parallel work-share
• OpenMP shortcut: Put the “parallel” and the
work-share on the same line
int i; int i;
double res[MAX]; double res[MAX];
#pragma omp parallel #pragma omp parallel for
{ for (i=0;i< MAX; i++) {
#pragma omp for res[i] = huge();
for (i=0;i< MAX; i++) { }
res[i] = huge();
}
}
the same OpenMP
ECE 563 Programming Parallel Machines 15
Shared Memory Model
private private
thread2 • Data can be shared or
thread1
private
• Shared data is accessible
Shared by all threads
Memory • Private data can be
accessed only by the
thread3 thread that owns it
thread5
• Data transfer is transparent
thread4 to the programmer
private
private
private
ECE 563 Programming Parallel Machines 16
Data Environment:
Default storage attributes
• Shared Memory programming model
– Variables are shared by default
• Distributed Memory Programming Model
– All variables are private
ECE 563 Programming Parallel Machines 17
Data Environment:
Default storage attributes
• Global variables are SHARED among threads
– Fortran: COMMON blocks, SAVE variables, MODULE variables
– C: File scope variables, static
• But not everything is shared...
– Stack variables in sub-programs called from parallel regions
are PRIVATE
– Automatic variables within a statement block are PRIVATE.
ECE 563 Programming Parallel Machines 18
Data Environment
int A[100]; /* (Global) SHARED */
int main()
{
int ii, jj; /* PRIVATE */
int foo(int x)
int B[100]; /* SHARED */
{
#pragma omp parallel private(jj)
/* PRIVATE */
{
int count=0;
int kk = 1; /* PRIVATE */
return x*count;
#pragma omp for
}
for (ii=0; ii<N; ii++)
for (jj=0; jj<N; jj++)
A[ii][jj] = foo(B[ii][jj]);
}
}
ECE 563 Programming Parallel Machines 19
Work Sharing Construct
Loop Construct
#pragma omp for [clause[[,] clause …] new-line
for-loops
Where clause is one of the following:
private / firstprivate / lastprivate(list)
reduction(operator: list)
schedule(kind[, chunk_size])
collapse(n)
ordered
nowait
ECE 563 Programming Parallel Machines 20
Schedule
for (i=0; i<1100; i++)
A[i] = … ;
#pragma omp parallel for schedule (static, 250) or (static)
250 250 250 250 100 or 275 275 275 275
p0 p1 p2 p3 p0 p0 p1 p2 p3
#pragma omp parallel for schedule (dynamic, 200)
200 200 200 200 200 100
p3 p0 p2 p3 p1 p0
#pragma omp parallel for schedule (guided, 100)
137 120 105 100 100 100 100 100 100 100 38
p0 p3 p0 p1 p2 p3 p0 p1 p2 p3 p0
#pragma omp parallel for schedule (auto)
ECE 563 Programming Parallel Machines 21
Critical Construct
sum = 0;
#pragma omp parallel private (lsum)
{
lsum = 0;
#pragma omp for
for (i=0; i<N; i++) {
lsum = lsum + A[i];
}
#pragma omp critical
{ sum += lsum; } Threads wait their turn;
} only one thread at a time
executes the critical section
ECE 563 Programming Parallel Machines 22
Reduction Clause
Shared variable
sum = 0;
#pragma omp parallel for reduction (+:sum)
for (i=0; i<N; i++)
{
sum = sum + A[i];
}
ECE 563 Programming Parallel Machines 23
Performance Evaluation
• How do we measure performance? (or
how do we remove noise?)
#define N 24000
For (k=0; k<10; k++)
{
#pragma omp parallel for private(i, j)
for (i=1; i<N-1; i++)
for (j=1; j<N-1; j++)
a[i][j] = (b[i][j-1]+b[i][j+1])/2.0;
}
ECE 563 Programming Parallel Machines 24
Performance IssuesSpeedup
• What if you see
a speedup saturation?
# CPUs
1 2 4 6 8
#define N 12000
#pragma omp parallel for private(j)
for (i=1; i<N-1; i++)
for (j=1; j<N-1; j++)
a[i][j] = (b[i][j-1]+b[i][j]+b[i][j+1]
b[i-1][j]+b[i+1][j])/5.0;
ECE 563 Programming Parallel Machines 25
Performance Issues
Speedup
• What if you see
a speedup saturation?
# CPUs
1 2 4 6 8
#define N 12000
#pragma omp parallel for private(j)
for (i=1; i<N-1; i++)
for (j=1; j<N-1; j++)
a[i][j] = b[i][j];
ECE 563 Programming Parallel Machines 26
Loop Scheduling
• Any guideline for a chunk size?
#define N <big-number>
chunk = ???;
#pragma omp parallel for schedule (static, chunk)
for (i=1; i<N-1; i++)
a[i][j] = ( b[i-2] + b[i-1] + b[i]
b[i+1] + b[i+2] )/5.0;
ECE 563 Programming Parallel Machines 27
Performance Issues
• Load imbalance: triangular access pattern
#define N 12000
#pragma omp parallel for private(j)
for (i=1; i<N-1; i++)
for (j=i; j<N-1; j++)
a[i][j] = (b[i][j-1]+b[i][j]+b[i][j+1]
b[i-1][j]+b[i+1][j])/5.0;
ECE 563 Programming Parallel Machines 28
Summary
• OpenMP has advantages
– Incremental parallelization
– Compared to MPI
• No data partitioning
• No communication scheduling
ECE 563 Programming Parallel Machines 29
Resources
http://www.openmp.org
http://openmp.org/wp/resources
ECE 563 Programming Parallel Machines 30