11/7/17
MPSoC Architectures
OpenMP
Alberto Bosio, Associate Professor – UM
Microelectronic Departement
[email protected] Introduction to OpenMP
l What is OpenMP?
l Open specification for Multi-Processing
l “Standard” API for defining multi-threaded shared-
memory programs
–www.openmp.org – Talks, examples, forums, etc.
l High-level API
l Preprocessor (compiler) directives ( ~ 80% )
l Library Calls ( ~ 19% )
l Environment Variables ( ~ 1% )
1
11/7/17
A Programmer’s View of OpenMP
l OpenMP is a portable, threaded, shared-memory
programming specification with “light” syntax
l Exact behavior depends on OpenMP implementation!
l Requires compiler support (C or Fortran)
l OpenMP will:
l Allow a programmer to separate a program into serial
regions and parallel regions, rather than T concurrently-
executing threads.
l Hide stack management
l Provide synchronization constructs
l OpenMP will not:
l Parallelize (or detect!) dependencies
l Guarantee speedup
l Provide freedom from data races
Outline
l Introduction
l Motivating example
l Parallel Programming is Hard
l OpenMP Programming Model
l Easier than PThreads
l Microbenchmark Performance Comparison
l vs. PThreads
l Discussion
l specOMP
2
11/7/17
Current Parallel Programming
1. Start with a parallel algorithm
2. Implement, keeping in mind:
• Data races
• Synchronization
• Threading Syntax
3. Test & Debug
4. Debug
5. Debug
Motivation – Threading Library
void* SayHello(void *foo) {
printf( "Hello, world!\n" );
return NULL;
}
int main() {
pthread_attr_t attr;
pthread_t threads[16];
int tn;
pthread_attr_init(&attr);
pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);
for(tn=0; tn<16; tn++) {
pthread_create(&threads[tn], &attr, SayHello, NULL);
}
for(tn=0; tn<16 ; tn++) {
pthread_join(threads[tn], NULL);
}
return 0;
}
3
11/7/17
Motivation
• Thread libraries are hard to use
– P-Threads/Solaris threads have many library calls
for initialization, synchronization, thread creation,
condition variables, etc.
– Programmer must code with multiple threads in
mind
• Synchronization between threads introduces a
new dimension of program correctness
Motivation
l Wouldn’t it be nice to write serial programs and
somehow parallelize them “automatically”?
l OpenMP can parallelize many serial programs with
relatively few annotations that specify parallelism
and independence
l OpenMP is a small API that hides cumbersome
threading calls with simpler directives
4
11/7/17
Better Parallel Programming
1. Start with some algorithm
• Embarrassing parallelism is helpful, but not
necessary
2. Implement serially, ignoring:
• Data Races
• Synchronization
• Threading Syntax
3. Test and Debug
4. Automatically (magically?) parallelize
• Expect linear speedup
Motivation – OpenMP
int main() {
// Do this part in parallel
printf( "Hello, World!\n" );
return 0;
}
5
11/7/17
Motivation – OpenMP
int main() {
omp_set_num_threads(16);
// Do this part in parallel
#pragma omp parallel
{
printf( "Hello, World!\n" );
}
return 0;
}
OpenMP Parallel Programming
1. Start with a parallelizable algorithm
• Embarrassing parallelism is good, loop-level
parallelism is necessary
2. Implement serially, mostly ignoring:
• Data Races
• Synchronization
• Threading Syntax
3. Test and Debug
4. Annotate the code with parallelization (and
synchronization) directives
• Hope for linear speedup
5. Test and Debug
6
11/7/17
Programming Model - Threading
l Serial regions by default,
annotate to create parallel
regions
l Generic parallel regions
Fork
l Parallelized loops
l Sectioned parallel regions
l Thread-like Fork/Join model Join
l Arbitrary number of logical
thread creation/ destruction
events
Programming Model - Threading
int main() {
// serial region
printf(“Hello…”);
// parallel region
Fork
#pragma omp parallel
{
printf(“World”);
}
// serial again Join
Hello…WorldWorldWorldWorld!
printf(“!”);
}
7
11/7/17
Programming Model – Nested
Threading
• Fork/Join can be nested
Fork
– Nesting complication handled
“automagically” at compile-time
Fork
– Independent of the number of
threads actually running Join
Join
Programming Model – Thread
Identification
Master Thread
• Thread with ID=0 0
• Only thread that exists in
sequential regions
Fork
• Depending on implementation,
may have special purpose inside 0 1 2 3 4 5 6 7
parallel regions
• Some special directives affect only
Join
the master thread (like master)
0
8
11/7/17
Example
int main() {
int tid, nthreads;
omp_set_num_threads(16);
// Do this part in parallel
#pragma omp parallel private(nthreads, tid)
{
printf( "Hello, World!\n" );
/* Obtain and print thread id */
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("I'm the master, Number of threads = %d\n", nthreads);
}
return 0;
Programming Model – Data/Control
Parallelism
l Data parallelism
l Threads perform similar
functions, guided by thread Fork
identifier
l Control parallelism
l Threads perform differing
functions
- One thread for I/O, one for
computation, etc… Join
9
11/7/17
Programming model: Summary
Memory Model
l Shared memory communication
l Threads cooperates by accessing shared variables
l The sharing is defined syntactically
l Any variable that is seen by two or more threads is
shared
l Any variable that is seen by one thread only is
private
l Race conditions possible
l Use synchronization to protect from conflicts
l Change how data is stored to minimize the
synchronization
10
11/7/17
Structure
Programming Model – Concurrent
Loops
l OpenMP easily parallelizes loops
l No data dependencies between
iterations!
l Preprocessor calculates loop
bounds for each thread directly
from serial source
#pragma omp parallel for
for( i=0; i < 25; i++ ) {
printf(“Foo”);
}
11
11/7/17
The problem
l Executes the same code as many times as there are
threads
l How many threads do we have? omp_set_num_threads(n)
What is the use of repeating the same work n times in
parallel? Can use omp_thread_num() to distribute the work
between threads.
l D is shared between the threads, i and sum are private
Programming Model – Concurrent
Loops
12
11/7/17
Programming Model – Concurrent
Loops
Programming Model – Concurrent
Loops
l Load balancing
l If all the iterations execute at the same speed, the
processors are used optimally If some iterations are faster
than others, some processors may get idle, reducing the
speedup
l We don't always know the distribution of work, may need to
re-distribute dynamically
l Granularity
l Thread creation and synchronization takes time Assigning
work to threads on per-iteration resolution may take more
time than the execution itself! Need to coalesce the work to
coarse chunks to overcome the threading overhead
l Trade-off between load balancing and granularity!
13
11/7/17
Controlling Granularity
l #pragma omp parallel if (expression)
l Can be used to disable parallelization in some
cases (when the input is determined to be too small
to be beneficially multithreaded)
l #pragma omp num_threads (expression)
l Control the number of threads used for this parallel
region
Programming Model – Loop
Scheduling
• schedule clause determines how loop iterations are
divided among the thread team
–static([chunk]) divides iterations statically between
threads
- Each thread receives [chunk] iterations, rounding as
necessary to account for all iterations
- Default [chunk] is ceil( # iterations / # threads
)
–dynamic([chunk]) allocates [chunk] iterations per
thread, allocating an additional [chunk] iterations when
a thread finishes
- Forms a logical work queue, consisting of all loop iterations
- Default [chunk] is 1
–guided([chunk]) allocates dynamically, but
[chunk] is exponentially reduced with each allocation
14
11/7/17
Programming Model – Loop
Scheduling
Example
l The function TestForPrime (usually) takes little
time But can take long, if the number is a prime
indeed
l Solution: use dynamic, but with chunks
15
11/7/17
Work sharing: Sections
Sections
l The SECTIONS directive is a non-iterative
work-sharing construct. It specifies that the
enclosed section(s) of code are to be divided
among the threads in the team.
l Independent SECTION directives are nested
within a SECTIONS directive.
l Each SECTION is executed once by a thread in the
team. Different sections may be executed by
different threads. It is possible that for a thread to
execute more than one section if it is quick enough
and the implementation permits such.
16
11/7/17
Example
#include <omp.h>
#define N 1000
main ()
{
int i;
float a[N], b[N], c[N], d[N];
/* Some initializations */
for (i=0; i < N; i++) {
a[i] = i * 1.5;
b[i] = i + 22.35;
}
Example
#pragma omp parallel shared(a,b,c,d) private(i)
{
#pragma omp sections
{
#pragma omp section
for (i=0; i < N; i++)
c[i] = a[i] + b[i];
#pragma omp section
for (i=0; i < N; i++)
d[i] = a[i] * b[i];
} /* end of sections */
} /* end of parallel section */
}
17
11/7/17
Data Sharing
l Shared Memory programming model
l Most variables are shared by default
l We can define a variable as private
// Do this part in parallel
#pragma omp parallel private(nthreads, tid)
{
printf( "Hello, World!\n" );
if (tid == 0)
{
….
}
Programming Model – Data Sharing
l Parallel programs often employ
two types of data int bigdata[1024];
l Shared data, visible to all threads,
similarly named
l Private data, visible to a single thread void* foo(void* bar) {
(often stack-allocated)
int tid;
• PThreads:
– Global-scoped variables are shared
– Stack-allocated variables are private #pragma omp parallel \
shared ( bigdata ) \
private ( tid )
• OpenMP: {
– shared variables are shared /* Calc. here */
– private variables are private
}
}
18
11/7/17
Programming Model – Data Sharing
l private:
l A copy of the variable is created for each thread.
l No connection between the original variable and the
private copies
l Can achieve the same using variables inside { }
Int i;
#pragma omp parallel for private(i)
for (i=0; i<n; i++) { ... }
Programming Model – Data Sharing
l Firstprivate:
l Same, but the initial value is copied from the main
copy
l Lastprivate:
l Same, but the last value is copied to the main copy
19
11/7/17
Thread private
l Similar to private, but defined per variable
l Declaration immediately after variable definition.
l Must be visible in all translation units. Persistent
between parallel sections
l Can be initialized from the master's copy with
l #pragma omp copyin
l More efficient than private, but a global variable!
Synchronization
l What should the result be (assuming 2
threads)?
X=0;
#pragma omp parallel
X = X+1;
20
11/7/17
Synchronization
l 2 is the expected answer But can be 1 with
unfortunate interleaving
l OpenMP assumes that the programmer knows
what he is doing
l Regions of code that are marked to run in
parallel are independent If access collisions are
possible, it is the programmer's responsibility to
insert protection
Synchronization
l Many of the existing mechanisms for shared
programming
l OpenMP Synchronization
l Nowait (turn synchronization off!)
l Single/Master execution
l Critical sections, Atomic updates
l Ordered
l Barriers
l Flush (memory subsystem synchronization)
l Reduction (special case)
21
11/7/17
Single/Master
l #pragma omp single
l Only one of the threads will execute the following
block of code
l The rest will wait for it to complete
l Good for non-thread-safe regions of code (such as
I/O)
l Must be used in a parallel region
l Applicable to parallel for sections
Single/Master
l #pragma omp master
l The following block will be executed by the master
thread
l No synchronization involved
l Applicable only to parallel sections
#pragma omp parallel
{
do_preprocessing () ;
#pragma omp single
read_input () ;
#pragma omp master
notify_input_consumed () ;
do_processing () ; }
22
11/7/17
Critical Sections
l #pragma omp critical [name]
l Standard critical section functionality
l Critical sections are global in the program
l Can be used to protect a single resource in different
functions
l Critical sections are identified by the name
l All the unnamed critical sections are mutually
exclusive throughout the program
l All the critical sections having the same name are
mutually exclusive between themselves
Critical Sections
int x=0;
#pragma omp parallel shared(x)
{
#pragma omp critical
x++;
}
23
11/7/17
Ordered
l #pragma omp ordered statement
l Executes the statement in the sequential order
of iterations
l Example:
#pragma omp parallel for ordered
for (j=0; j<N; j++) {
int result = j*j;
#pragma omp ordered
printf ("computation(%d) = %d\n" ,j ,
result ) ;
}
Barrier synchronization
l #pragma omp barrier
l Performs a barrier synchronization between all
the threads in a team at the given point.
l Example:
#pragma omp parallel
{
int result = heavy_computation_part1 ()
;
#pragma omp atomic
sum += result ;
#pragma omp barrier
heavy_computation_part2 (sum) ;
}
24
11/7/17
Explicit Locking
l Can be used to pass lock variables around
(unlike critical sections!)
l Can be used to implement more involved
synchronization constructs
l Functions:
l omp_init_lock(), omp_destroy_lock(),
omp_set_lock(), omp_unset_lock(), omp_test_lock()
The usual semantics
l Use #pragma omp flush to synchronize memory
Consistency Violation?
25
11/7/17
Consistency Violation?
Reduction
for (j=0; j<N; j++) {
sum =
sum+a[j]∗b[j];
}
l How to parallelize this code?
l sum is not private, but accessing it atomically is too
expensive
l Have a private copy of sum in each thread, then
add them up
l Use the reduction clause!
l #pragma omp parallel for reduction(+: sum)
l An operator must be used: +, -, *...
26
11/7/17
Synchronization Overhead
l Lost time waiting for locks
l Prefer to use structures that are as lock-free as
possible!
Summary
l OpenMP is a compiler-based technique to create
concurrent code from (mostly) serial code
l OpenMP can enable (easy) parallelization of loop-
based code
l Lightweight syntactic language extensions
l OpenMP performs comparably to manually-coded
threading
l Scalable
l Portable
l Not a silver bullet for all applications
27
11/7/17
More Information
• www.openmp.org
l OpenMP official site
• www.llnl.gov/computing/tutorials/openMP/
l A handy OpenMP tutorial
• www.nersc.gov/nusers/help/tutorials/openmp/
l Another OpenMP tutorial and reference
Backup Slides
Syntax, etc
28
11/7/17
lOpenMP Syntax
l General syntax for OpenMP directives
#pragma omp directive [clause…] CR
l Directive specifies type of OpenMP operation
l Parallelization
l Synchronization
l Etc.
l Clauses (optional) modify semantics of Directive
lOpenMP Syntax
l PARALLEL syntax
#pragma omp parallel [clause…] CR
structured_block
Ex:
#pragma omp parallel
Output:
Hello! (T=4)
Hello!
{
Hello!
printf(“Hello!\n”);
Hello!
} // implicit barrier
29
11/7/17
l OpenMP Syntax
l DO/for Syntax (DO-Fortran, for-C)
#pragma omp for [clause…] CR
for_loop
Ex:
#pragma omp parallel
{
#pragma omp for private(i) shared(x) \
schedule(static,x/N)
for(i=0;i<x;i++) printf(“Hello!\n”);
} // implicit barrier
Note: Must reside inside a parallel section
l OpenMP Syntax
More on Clauses
• private() – A variable in private list is private
to each thread
• shared() – Variables in shared list are visible
to all threads
l Implies no synchronization, or even consistency!
• schedule() – Determines how iterations will
be divided among threads
–schedule(static, C) – Each thread will be
given C iterations
- Usually T*C = Number of total iterations
–schedule(dynamic) – Each thread will be given
additional iterations as-needed
- Often less efficient than considered static allocation
• nowait – Removes implicit
CS Architecture Seminarbarrier from end of
block
30
11/7/17
OpenMP Syntax
l
l PARALLEL FOR (combines parallel and
for)
#pragma omp parallel for [clause…] CR
for_loop
Ex:
#pragma omp parallel for shared(x)\
private(i)
\
schedule(dynamic)
for(i=0;i<x;i++) {
printf(“Hello!\n”);
lExample: AddMatrix
Files:
(Makefile)
addmatrix.c // omp-
parallelized
matrixmain.c // non-omp
printmatrix.c // non-omp
31
11/7/17
lOpenMP Syntax
l ATOMIC syntax
#pragma omp atomic CR
simple_statement
Ex:
#pragma omp parallel shared(x)
{
#pragma omp atomic
x++;
} // implicit barrier
OpenMP Syntax
• CRITICAL syntax
#pragma omp critical CR
structured_block
Ex:
#pragma omp parallel shared(x)
{
#pragma omp critical
{
// only one thread in here
}
} // implicit barrier
32
11/7/17
l OpenMP Syntax
ATOMIC vs. CRITICAL
l Use ATOMIC for “simple statements”
l Can have lower overhead than CRITICAL if HW
atomics are leveraged (implementation dep.)
l Use CRITICAL for larger expressions
l May involve an unseen implicit lock
l OpenMP Syntax
l MASTER – only Thread 0 executes a block
#pragma omp master CR
structured_block
l SINGLE – onlyomp
#pragma one single
thread executes
CR a block
structured_block
l No implied synchronization
33
11/7/17
lOpenMP Syntax
l BARRIER
#pragma omp barrier CR
l Locks
l Locks are provided through omp.h library calls
–omp_init_lock()
–omp_destroy_lock()
–omp_test_lock()
–omp_set_lock()
–omp_unset_lock()
lOpenMP Syntax
l FLUSH
#pragma omp flush CR
l Guarantees that threads’ views of memory is
consistent
l Why? Recall OpenMP directives…
l Code is generated by directives at compile-time
- Variables are not always declared as volatile
- Using variables from registers instead of memory can
seem like a consistency violation
l Synch. Often has an implicit flush
- ATOMIC, CRITICAL
34
11/7/17
lOpenMP Syntax
l Functions
omp_set_num_threads()
omp_get_num_threads()
omp_get_max_threads()
omp_get_num_procs()
omp_get_thread_num()
omp_set_dynamic()
omp_[init destroy test set
unset]_lock()
Function for the environment
l omp_set_dynamic(int)
l omp_set_num_threads(int)
l omp_get_num_threads()
l omp_get_num_procs()
l omp_get_thread_num()
l omp_set_nested(int)
l omp_in_parallel()
l omp_get_wtime()
35