OpenMP for Intranode Programming
Synchronization, Data Sharing Environment,
and runtime library and environment variables
These slides were originally written by Dr. Barbara Chapman, University of Houston
OpenMP Memory Model
OpenMP assumes a shared memory
Threads communicate by sharing variables.
Synchronization protects data conflicts.
Synchronization is expensive.
Change how data is accessed to minimize the need for synchronization.
2
OpenMP Syntax
Most OpenMP constructs are compiler directives
For C and C++, they are pragmas with the form:
#pragma omp construct [clause [clause]…]
For Fortran, the directives may have fixed or free form:
*$OMP construct [clause [clause]…]
C$OMP construct [clause [clause]…]
!$OMP construct [clause [clause]…]
Include file and the OpenMP lib module
#include <omp.h>
use omp_lib
Most OpenMP constructs apply to a “structured block”.
A block of one or more statements with one point of entry at the top
and one point of exit at the bottom.
It’s OK to have an exit() within the structured block.
OpenMP sentinel forms: #pragma omp !$OMP
3
OpenMP schedule Clause
The schedule clause affects how loop iterations are mapped onto threads
schedule ( static | dynamic | guided [, chunk] )
schedule ( auto | runtime )
static Distribute iterations in blocks of size "chunk" over the
threads in a round-robin fashion
dynamic Fixed portions of work; size is controlled by the value of
chunk. When a thread finishes, it starts on the next portion of
work
guided Same dynamic behavior as "dynamic", but size of the portion
of work decreases exponentially
auto The compiler (or runtime system) decides what is best to use;
choice could be implementation dependent
runtime Iteration scheduling scheme is set at runtime through
environment variable OMP_SCHEDULE
4
Example Of A Static Schedule
A loop of length 16 using 4 threads
Thread 0 1 2 3
no chunk * 1-4 5-8 9-12 13-16
chunk = 2 1-2 3-4 5-6 7-8
9-10 11-12 13-14 15-16
*) The precise distribution is implementation defined
The Schedule Clause
Schedule Clause When To Use Least work at
runtime :
scheduling
STATIC Pre-determined and done at
predictable by the compile-time
programmer
DYNAMIC Unpredictable, highly Most work at
variable work per runtime :
iteration complex
scheduling
GUIDED Special case of dynamic logic used at
to reduce scheduling run-time
overhead
OpenMP Synchronization
Synchronization enables the user to
Control the ordering of executions in different threads
Ensure that at most one thread executes operation or
region of code at any given time (mutual exclusion)
High level synchronization:
critical section
atomic
barrier
ordered
Low level synchronization:
flush
locks (both simple and nested)
7
Barrier
We need to update all of a[ ] before using a[ ] *
for (i=0; i < N; i++)
a[i] = b[i] + c[i];
wait !
barrier
for (i=0; i < N; i++)
d[i] = a[i] + b[i];
All threads wait at the barrier point and only continue when all
threads have reached the barrier point
*) If the mapping of iterations onto threads is guaranteed to be
identical for both loops, we do not need to wait in this case
Barrier
Barrier Region
idle
idle
idle
time
Barrier syntax in OpenMP:
#pragma omp barrier !$omp barrier
Barrier
Each thread waits until all threads arrive.
#pragma omp parallel shared (A, B, C) private(id)
{
id=omp_get_thread_num();
A[id] = big_calc1(id); implicit barrier at the
#pragma omp barrier end of a for work-
#pragma omp for sharing construct
for(i=0;i<N;i++){C[i]=big_calc3(I,A);}
#pragma omp for nowait
for(i=0;i<N;i++){ B[i]=big_calc2(C, i); }
A[id] = big_calc3(id);
} no implicit barrier
implicit barrier at the end
of a parallel region due to nowait 10
The Nowait Clause
Barriers are implied at end of parallel region,
for/do, sections and single constructs
Barrier can be suppressed by using the
optional nowait clause
If present, threads do not synchronize/wait at the
end of that particular construct
#pragma omp for nowait !$omp do
{ :
: :
} !$omp end do nowait
Critical Section
Mutual exclusion: Code may only be executed by
at most one thread at any given time
Could lead to long wait times for other threads
Atomic updates for individual operations
Critical regions and locks for structured regions of code
critical region
time
Critical Region (Section)
Only one thread at a time can enter a critical
region
float res;
#pragma omp parallel
{ float B; int i;
#pragma omp for
Threads wait for(i=0;i<niters;i++){
their turn – only
B = big_job(i);
one at a time
calls consume() #pragma omp critical
consume (B, RES);
}
Use e.g. when all threads }update a variable; if the order in which they do so is
unimportant, we need to ensure that they do not do it at the same time
13
Atomic The statement inside the
atomic must be one of:
x binop= expr
Atomic is a special case of x = x binop expr
mutual exclusion x = expr binop x
x++
It applies only to the update ++x
of a memory location x—
--x
X is an lvalue of scalar type
and binop is a non-
C$OMP PARALLEL PRIVATE(B) overloaded built in operator.
B = DOIT(I)
tmp = big_ugly(); OpenMP 3.1 describes the behavior
in more detail via theSE clauses:
C$OMP ATOMIC read, write, update, capture
X = X + temp
The pre-3.1 atomic construct is
C$OMP END PARALLEL
equivalent to
#pragma omp atomic capture
14
Ordered
The ordered construct enforces the sequential order for a
block.
Code is executed in order in which iterations would be
performed sequentially
The worksharing construct has to have the ordered clause
#pragma omp parallel private (tmp)
#pragma omp for ordered
for (i=0;i<N;i++){
tmp = NEAT_STUFF(i);
#pragma ordered
res += consum(tmp);
}
15
Updates to Shared Data
Blocks of data are fetched into cache lines
Values may temporarily differ from other copies of
data within a parallel region
a
Shared memory
a
cache1 cache2 cache3 cacheN
proc1 proc2 proc3 procN
16
a
The Flush Directive
The flush construct denotes a sequence point where
a thread tries to create a consistent view of memory
for specified variables.
All memory operations (both reads and writes) defined
prior to the sequence point must complete.
All memory operations (both reads and writes) defined
after the sequence point must follow the flush.
Variables in registers or write buffers must be updated
in memory.
Arguments to flush specify which variables are
flushed.
If no arguments are specified, all thread visible
variables are flushed.
17
What Else Does Flush Influence?
The flush operation does not
actually synchronize different
threads. It just ensures that a
thread’s values are made
consistent with main memory.
Something to note:
Compilers reorder instructions to better exploit the functional
units and keep the machine busy
Flush prevents the compiler from doing the following:
Reorder read/writes of variables in a flush set relative to a flush.
Reorder flush constructs when flush sets overlap.
A compiler CAN do the following:
Reorder instructions NOT involving variables in the flush set
relative to the flush.
Reorder flush constructs that don’t have overlapping flush sets.
18
A Flush Example
Pair-wise synchronization.
integer ISYNC(NUM_THREADS)
C$OMP PARALLEL DEFAULT (PRIVATE) SHARED (ISYNC)
IAM = OMP_GET_THREAD_NUM()
ISYNC(IAM) = 0
Make sure other threads can
C$OMP BARRIER see my write.
CALL WORK()
ISYNC(IAM) = 1 ! I’m all done; signal this to other threads
C$OMP FLUSH(ISYNC)
DO WHILE (ISYNC(NEIGH) .EQ. 0)
C$OMP FLUSH(ISYNC)
END DO Make sure the read picks up a
C$OMP END PARALLEL good copy from memory.
Note: flush is analogous to a fence in other shared
memory APIs.
19
Implied Flush
Flushes are implicitly performed during execution:
In a barrier region
At exit from worksharing regions, unless a nowait is present
At entry to and exit from parallel, critical, ordered and parallel
worksharing regions
During omp_set_lock and omp_unset_lock regions
During omp_test_lock, omp_set_nest_lock, omp_unset
_nest_lock and omp_test_nest_lock regions, if the region
causes the lock to be set or unset
Immediately before and after every task scheduling point
At entry to and exit from atomic regions, where the list
contains only the variable updated in the atomic construct
But not on entry to a worksharing region, or entry to/exit from
a master region,
Managing the data environment
21
OpenMP Memory Model
OpenMP assumes a shared memory
Threads communicate by sharing variables.
Synchronization protects data conflicts.
Synchronization is expensive.
Change how data is accessed to minimize the need for synchronization.
22
Data-Sharing Attributes
In OpenMP code, data needs to be “labeled”
There are two basic types:
Shared – there is only one instance of the data
Threads can read and write the data simultaneously
unless protected through a specific construct
All changes made are visible to all threads
– But not necessarily immediately, unless enforced ......
Private - Each thread has a copy of the data
No other thread can access this data
Changes only visible to the thread owning the data
OpenMP Data Environment
Most variables are shared by default
Global variables are SHARED among threads
Fortran: COMMON blocks, SAVE variables, MODULE
variables
C: File scope variables, static
But not everything is shared by default...
Stack variables in sub-programs called from parallel
regions are PRIVATE
Automatic variables defined inside the parallel region are
PRIVATE.
The default status can be modified with:
DEFAULT (PRIVATE | SHARED | NONE)
All data clauses apply to parallel regions and worksharing constructs except “shared”
24
which only applies to parallel regions.
About Storage Association
Private variables are undefined on entry and
exit of the parallel region
A private variable within a parallel region has
no storage association with the same variable
outside of the region
Use the firstprivate and lastprivate clauses
to override this behavior
We illustrate these concepts with an example
OpenMP Data Environment
double a[size][size], b=4;
#pragma omp parallel private (b)
{ .... }
shared data
a[size][size]
private data private data private data private data
b=6 b=8 b =? b =?
T0 T1 T2 T3
b becomes undefined
on exit from region
OpenMP Data Environment
program sort subroutine work (index)
common /input/ A(10) common /input/ A(10)
integer index(10) integer index(*)
C$OMP PARALLEL real temp(10)
call work (index) …………
C$OMP END PARALLEL
print*, index(1)
A, index
A and index are shared
by all threads.
temp temp temp
temp is local to each
thread
A, index
27
OpenMP Private Clause
private(var) creates a local copy of var for each
thread.
The value is uninitialized
Private copy is not storage-associated with the original
The original is undefined at the end
IS = 0
C$OMP PARALLEL DO PRIVATE(IS)
DO J=1,1000
IS = IS + J
✗
END DO
IS was not
C$OMP END PARALLEL DO initialized
print *, IS
IS is undefined
here, regardless
28
of initialization
(In)Visibility of Private Data
#pragma omp parallel private(x) shared(p0, p1)
Thread 0 Thread 1
X = …; X = …;
P0 = &x; P1 = &x;
/* references in the following line are not allowed */
… *p1 … … *p0 …
You can not reference another’s threads private variables … even if you have a
shared pointer between the two threads.
29
The Firstprivate And Lastprivate Clauses
firstprivate (list)
All variables in the list are initialized with the
value the original object had before entering
the parallel construct
lastprivate (list)
The thread that executes the sequentially last
iteration or section updates the value of the
objects in the list
Firstprivate Clause
firstprivate is a special case of private.
Initializes each private copy with the corresponding
value from the master thread.
✗
✔
IS = 0
C$OMP PARALLEL DO FIRSTPRIVATE(IS)
DO 20 J=1,1000
IS = IS + J
20 CONTINUE
C$OMP END PARALLEL DO Each thread gets its own IS
print *, IS with an initial value of 0
Regardless of initialization, IS is
undefined at this point
31
Lastprivate Clause
Lastprivate passes the value of a private variable
from the last iteration to the variable of the master
thread
IS = 0
✔
C$OMP PARALLEL DO FIRSTPRIVATE(IS)
C$OMP& LASTPRIVATE(IS)
DO 20 J=1,1000
Are you sure ?
IS = IS + J
20 CONTINUE
C$OMP END PARALLEL DO Each thread gets its own IS
print *, IS with an initial value of 0
IS is defined as its value at the last
iteration (i.e. for J=1000)
32
A Data Environment Checkup
Consider this example of PRIVATE and FIRSTPRIVATE
C variables A,B, and C = 1
C$OMP PARALLEL PRIVATE(B)
C$OMP& FIRSTPRIVATE(C)
Are A,B,C local to each thread or shared inside the parallel region?
What are their initial values inside and after the parallel region?
33
A Data Environment Checkup
Consider this example of PRIVATE and FIRSTPRIVATE
C variables A,B, and C = 1
C$OMP PARALLEL PRIVATE(B)
C$OMP& FIRSTPRIVATE(C)
Are A,B,C local to each thread or shared inside the parallel region?
What are their initial values inside and after the parallel region?
Inside this parallel region ...
“A” is shared by all threads; equals 1
“B” and “C” are local to each thread.
– B’s initial value is undefined
– C’s initial value equals 1
Outside this parallel region ...
34 The values of “B” and “C” are undefined.
OpenMP Reduction
If it’s the sum of all J values that you need, there is a way to do that too.
We have already seen how
IS = 0
C$OMP PARALLEL DO REDUCTION(+:IS)
DO 1000 J=1,1000
IS = IS + J
1000 CONTINUE
print *, IS
Result variable is shared by default
35
OpenMP Reduction
Combines an accumulation operation across threads:
reduction (op : list)
Inside a parallel or work-sharing construct:
A local copy of each list variable is made and initialized
depending on the “op” (e.g. 0 for “+”).
Compiler finds standard reduction expressions containing “op”
and uses them to update the local copy.
Local copies are reduced into a single value and combined with
the original global value.
The variables in “list” must be shared in the enclosing
parallel region.
36
The Reduction Clause
reduction ( operator: list ) C/C++
reduction ( [operator | intrinsic] ) : list ) Fortran
Reduction variable(s) must be shared variables
Check the specs
A reduction is defined as: for details
Fortran C/C++
x = x operator expr x = x operator expr
x = expr operator x x = expr operator x
x = intrinsic (x, expr_list) x++, ++x, x--, --x
x = intrinsic (expr_list, x) x <binop> = expr
“min” and “max” intrinsic
Note that the value of a reduction variable is undefined
from the moment the first thread reaches the clause till
the operation has completed
The reduction can be hidden in a function call
Reduction Example
Remember the code we used to demo private,
firstprivate and lastprivate.
program closer
IS = 0
DO J=1,1000
IS = IS + J
1000 CONTINUE
print *, IS program closer
IS = 0
#pragma omp parallel for reduction(+:IS)
DO J=1,1000
IS = IS + J
1000 CONTINUE
print *, IS
38
Example - The Reduction Clause
sum = 0.0
!$omp parallel default(none) &
!$omp shared(n,x) private(i) Variable SUM
!$omp do reduction (+:sum) is a shared
do i = 1, n variable
sum = sum + x(i)
end do
!$omp end do
!$omp end parallel
print *,sum
Care needs to be taken when updating shared
variable SUM
With the reduction clause, the OpenMP compiler
generates code that avoids a race condition
Reduction Operands/Initial Values
Associative operands used with reduction
Initial values are the ones that make sense
mathematically
Operand Initial value Operand Initial value
+ 0 .OR. 0
* 1 MAX 1
- 0 MIN 0
.AND. All 1’s // All 1’s
40
The Default Clause
default ( none | shared ) C/C++
default (none | shared | private | threadprivate ) Fortran
none
No implicit defaults; have to scope all variables explicitly
shared
All variables are shared
The default in absence of an explicit "default" clause
private
All variables are private to the thread
Includes common block data, unless THREADPRIVATE
firstprivate
All variables are private to the thread; pre-initialized
Default Clause Example
itotal = 1000
C$OMP PARALLEL PRIVATE(np, each)
np = omp_get_num_threads() Are these
each = itotal/np two codes
………
C$OMP END PARALLEL
equivalent?
itotal = 1000
C$OMP PARALLEL DEFAULT(PRIVATE) SHARED(itotal)
np = omp_get_num_threads()
each = itotal/np
………
C$OMP END PARALLEL
42
Default Clause Example
itotal = 1000
C$OMP PARALLEL PRIVATE(np, each)
np = omp_get_num_threads() Are these
each = itotal/np two codes
………
C$OMP END PARALLEL
equivalent?
itotal = 1000
C$OMP PARALLEL DEFAULT(PRIVATE) SHARED(itotal) yes
np = omp_get_num_threads()
each = itotal/np
………
C$OMP END PARALLEL
43
OpenMP Threadprivate
Makes global data private to a thread and persistent,
thus crossing parallel region boundary
Fortran: COMMON blocks
C: File scope and static variables
Different from making them PRIVATE
With PRIVATE, global variables are masked.
THREADPRIVATE preserves global scope within each thread
Threadprivate variables can be initialized using COPYIN
or by using DATA statements.
Some limitations on use of threadprivate
Consult specification before using
44
A Threadprivate Example
Consider two different routines called within a parallel region.
subroutine poo subroutine bar
parameter (N=1000) parameter (N=1000)
common/buf/A(N),B(N) common/buf/A(N),B(N)
!$OMP THREADPRIVATE(/buf/) !$OMP THREADPRIVATE(/buf/)
do i=1, N do i=1, N
B(i)= const* A(i) A(i) = sqrt(B(i))
end do end do
return return
end end
Because of the threadprivate construct, each thread executing these routines has
its own copy of the common block /buf/.
45
Threadprivate/Copyin
• You initialize threadprivate data using a copyin clause.
parameter (N=1000)
common/buf/A(N)
C$OMP THREADPRIVATE(/buf/)
C Initialize the A array
call init_data(N,A)
C$OMP PARALLEL COPYIN(A)
… Now each thread sees threadprivate array A initialized
… to the global value set in the subroutine init_data()
C$OMP END PARALLEL
....
C$OMP PARALLEL
... Values of threadprivate are persistent across parallel regions
C$OMP END PARALLEL
46
The Copyin Clause
copyin (list)
Applies to THREADPRIVATE common blocks only
At the start of the parallel region, data of the master thread is
copied to the thread private copies
Example:
common /cblock/velocity
common /fields/xfield, yfield, zfield
! create thread private common blocks
!$omp threadprivate (/cblock/, /fields/)
!$omp parallel &
!$omp default (private) & Data now
!$omp copyin ( /cblock/, zfield ) available to
threads
Copyprivate
Used with a single region to broadcast values of private
variables from one member of a team to the rest of the team.
#include <omp.h>
void input_parameters (int, int); // fetch values of input parameters
void do_work(int, int);
void main()
{
int Nsize, choice;
#pragma omp parallel private (Nsize, choice)
{ .....
#pragma omp single copyprivate (Nsize, choice)
input_parameters (Nsize, choice);
do_work(Nsize, choice);
}
}
48
Fortran - Allocatable Arrays
Fortran allocatable arrays whose status is
“currently allocated” are allowed to be specified as
private, lastprivate, firstprivate, reduction, or copyprivate
integer, allocatable,dimension (:) :: A
integer i
allocate (A(n))
!$omp parallel private (A)
do i = 1, n
A(i) = i
end do
...
!$omp end parallel
C++ And Threadprivate
❑OpenMP 3.0 clarified where/how
threadprivate objects are constructed and
destructed
❑Allow C++ static class members to be
threadprivate
class T {
public:
static int i;
#pragma omp threadprivate(i)
...
};
The runtime library and environment
variables
51
OpenMP Runtime Functions
OpenMP provides a set of runtime functions
They all start with “omp_”
These functions can be used to:
Query for a specific feature
E.g. what is my thread ID?
Change a setting
E.g. to change the number of threads in next parallel
region
A special category consists of the locking
functions
C/C++ : Need to include file <omp.h>
Fortran : Add “use omp_lib” or include file “omp_lib.h”
OpenMP Library Routines
Modify/Check the number of threads
omp_set_num_threads(), omp_get_num_threads(),
omp_get_thread_num(), omp_get_max_threads()
Are we in a parallel region?
omp_in_parallel()
How many processors in the system?
omp_num_procs()
53
OpenMP Library Routines
To use a known, fixed number of threads used in a program,
(1) tell the system that you don’t want dynamic adjustment of the
number of threads, (2) set the number threads, then (3) save the
number you got.
Disable dynamic adjustment of the
#include <omp.h> number of threads.
void main()
{ int num_threads; Request as many threads as
omp_set_dynamic( 0 ); you have processors.
omp_set_num_threads( omp_num_procs() );
#pragma omp parallel
Protect this op since Memory
{ int id=omp_get_thread_num();
stores are not atomic
#pragma omp single
num_threads = omp_get_num_threads();
do_lots_of_stuff(id);
}
} Even in this case, the system may give you fewer threads
than requested. If the precise # of threads matters, test for
54 it and respond accordingly.
OpenMP Runtime Functions
Name Functionality
omp_set_num_threads Set number of threads
omp_get_num_threads Number of threads in team
omp_get_max_threads Max num of threads for parallel region
omp_get_thread_num Get thread ID
omp_get_num_procs Maximum number of processors
omp_in_parallel Check whether in parallel region
omp_set_dynamic Activate dynamic thread adjustment
(but implementation is free to ignore this)
omp_get_dynamic Check for dynamic thread adjustment
omp_set_nested Activate nested parallelism
(but implementation is free to ignore this)
omp_get_nested Check for nested parallelism
omp_get_wtime Returns wall clock time
omp_get_wtick Number of seconds between clock ticks
OpenMP Runtime Functions
Name Functionality
omp_set_schedule Set schedule (if “runtime” is used)
omp_get_schedule Returns the schedule in use
omp_get_thread_limit Max number of threads for
program
omp_set_max_active_levels Set number of active parallel
regions
omp_get_max_active_levels Number of active parallel regions
omp_get_level Number of nested parallel regions
omp_get_active_level Number of nested active par.
regions
omp_get_ancestor_thread_num Thread id of ancestor thread
omp_get_team_size (level) Size of the thread team at this level
omp_in_final Check in final task or not
OpenMP Environment Variables
Set the default number of threads to use.
OMP_NUM_THREADS int_literal
Control how “omp for schedule(RUNTIME)”
loop iterations are scheduled.
OMP_SCHEDULE “schedule[, chunk_size]”
57
OpenMP Environment Variables/1
Default Oracle
OpenMP Environment Variable
Solaris Studio
OMP_NUM_THREADS 2
OMP_SCHEDULE
static, “N/P”
“schedule,[chunk]”
OMP_DYNAMIC {TRUE | FALSE} TRUE
OMP_NESTED {TRUE | FALSE} FALSE
OMP_STACKIZE “size [B|K|M|G]” 4 MB (32 bit)/8 MB (64 bit)
OMP_WAIT_POLICY [ACTIVE |
PASSIVE
PASSIVE]
OMP_MAX_ACTIVE_LEVELS 4
The names are in uppercase, the values are case insensitive
Be careful when relying on defaults (because they are
compiler dependent)
OpenMP Environment Variables/2
Default Oracle
OpenMP Environment Variable
Solaris Studio
OMP_THREAD_LIMIT 1024
OMP_PROC_BIND {TRUE | FALSE} FALSE