CS-402 Parallel and Distributed Systems
Fall 2024
Quick Review Lecture No. 05
Modern CPU core architecture
o Superscalar (pipeline, multiple issues)
o Branch prediction
o Out of order execution
o Many execution units
o Memory hierarchy
Implication on software performance
Locality – temporal locality and spatial locality
COMMUNICATION MODEL OF PARALLEL PLATFORMS
There are two primary forms of data exchange between parallel tasks – accessing a shared
data space and exchanging messages.
1. Platforms that provide a shared data space are called shared address-space machines or
multiprocessors.
2. Platforms that support messaging are also called message passing platforms or multi
computers.
SHARED-ADDRESS-SPACE PLATFORMS
Part (or all) of the memory is accessible to all processors.
• Processors interact by modifying data objects stored in this shared-address-space.
• If the time taken by a processor to access any memory word in the system global or local is identical,
the platform is classified as a uniform memory access (UMA), else, a non-uniform memory access
(NUMA) machine.
NUMA and UMA Shared-Address-Space Platforms:
The distinction between NUMA and UMA platforms is important from the point of view of algorithm
design. NUMA machines require locality from underlying algorithms for performance. Programming
these platforms is easier since reads and writes are implicitly visible to other processors.
However, read-write data to shared data must be coordinated (this will be discussed in greater detail
when we talk about threads programming).
• Caches in such machines require coordinated access to multiple copies. This leads to the cache
coherence problem.
• A weaker model of these machines provides an address map, but not coordinated access. These
models are called non cache coherent shared address space machines.
SHARED-ADDRESS-SPACE VS. SHARED MEMORY MACHINES
It is important to note the difference between the terms shared address space and shared
memory. We refer to the former as a programming abstraction and to the latter as a physical
machine attribute. It is possible to provide a shared address space using a physically
distributed memory.
Physical Organization of Parallel Platforms
We begin this discussion with an ideal parallel machine called Parallel Random Access Machine,
or PRAM.
Architecture of an Ideal Parallel Computer
A natural extension of the Random Access Machine (RAM) serial architecture is the Parallel
Random Access Machine, or PRAM.
PRAMs consist of p processors and a global memory of unbounded size that is uniformly
accessible to all processors.
Processors share a common clock but may execute different instructions in each cycle.
ARCHITECTURE OF AN IDEAL PARALLEL COMPUTER
Depending on how simultaneous memory accesses are handled, PRAMs can be divided into four
subclasses.
• Exclusive-read, exclusive-write (EREW) PRAM.
• Concurrent-read, exclusive-write (CREW) PRAM
• Exclusive-read, concurrent-write (ERCW) PRAM.
• Concurrent-read, concurrent-write (CRCW) PRAM.
• Common: write only if all values are identical.
• Arbitrary: write the data from a randomly selected processor.
• Priority: follow a predetermined priority order.
• Sum: Write the sum of all data items.
ARCHITECTURE OF AN IDEAL PARALLEL COMPUTER
Depending on how simultaneous memory accesses are handled, PRAMs can be divided into four
subclasses.
• Exclusive-read, exclusive-write (EREW) PRAM.
• Concurrent-read, exclusive-write (CREW) PRAM
• Exclusive-read, concurrent-write (ERCW) PRAM.
• Concurrent-read, concurrent-write (CRCW) PRAM.
• Common: write only if all values are identical.
• Arbitrary: write the data from a randomly selected processor.
• Priority: follow a predetermined priority order.
• Sum: Write the sum of all data items.
A GENERIC PARALLEL ARCHITECTURE
Proc Proc Proc
Proc
Proc
Proc
Interconnection Network
Memory
Memory Memory Memory
Memory
• Where is the memory physically located?
• Is it connected directly to processors?
• What is the connectivity of the network?
A BRIEF HISTORY OF PARALLEL LANGUAGES
• WHEN VECTOR MACHINES WERE KING
– PARALLEL “LANGUAGES” WERE LOOP ANNOTATIONS (IVDEP)
Mapping to MPPs/Clusters
– PERFORMANCE WAS FRAGILE, BUT GOOD USER SUPPORT proved hard
• WHEN SIMD MACHINES WERE KING
– DATA PARALLEL LANGUAGES POPULAR AND SUCCESSFUL (CMF, *LISP, C*, …)
– IRREGULAR DATA (SPARSE MAT-VEC MULTIPLY OK), BUT IRREGULAR COMPUTATION (DIVIDE AND
CONQUER, ADAPTIVE MESHES, ETC.) LESS CLEAR
• WHEN SHARED MEMORY MULTIPROCESSORS (SMPS) WERE KING
– SHARED MEMORY MODELS, E.G., POSIX THREADS, OPENMP, ARE POPULAR
• WHEN CLUSTERS TOOK OVER
– MESSAGE PASSING (MPI) BECAME DOMINANT
• WITH THE ADDITION OF ACCELERATORS
– OPENACC, CUDA WERE ADDED You’ll see the most
• IN CLOUD COMPUTING popular in each category
– HADOOP, SPARK, …
OUTLINE
• SHARED MEMORY PARALLELISM WITH THREADS
• WHAT AND WHY OPENMP?
• PARALLEL PROGRAMMING WITH OPENMP
• INTRODUCTION TO OPENMP
1. CREATING PARALLELISM
2. PARALLEL LOOPS
3. SYNCHRONIZING
4. DATA SHARING
• BENEATH THE HOOD
– SHARED MEMORY HARDWARE
• SUMMARY
RECALL PROGRAMMING MODEL 1: SHARED MEMORY
• PROGRAM IS A COLLECTION OF THREADS OF CONTROL.
– CAN BE CREATED DYNAMICALLY, MID-EXECUTION, IN SOME LANGUAGES
• EACH THREAD HAS A SET OF PRIVATE VARIABLES, E.G., LOCAL STACK VARIABLES
• ALSO A SET OF SHARED VARIABLES, E.G., STATIC VARIABLES, SHARED COMMON BLOCKS,
OR GLOBAL HEAP.
– THREADS COMMUNICATE IMPLICITLY BY WRITING AND READING SHARED VARIABLES.
– THREADS COORDINATE BY SYNCHRONIZING ON SHARED VARIABLES
Shared memory
s
s = ...
y = ..s ...
i: 2 i: 5 Private i: 8
memory
P0 P1 Pn
PARALLEL PROGRAMMING
WITH THREADS
OVERVIEW OF POSIX THREADS
POSIX: PORTABLE OPERATING SYSTEM INTERFACE
– INTERFACE TO OPERATING SYSTEM UTILITIES
• PTHREADS: THE POSIX THREADING INTERFACE
– SYSTEM CALLS TO CREATE AND SYNCHRONIZE THREADS
– SHOULD BE RELATIVELY UNIFORM ACROSS UNIX-LIKE OS PLATFORMS
• PTHREADS CONTAIN SUPPORT FOR
– CREATING PARALLELISM
– SYNCHRONIZING
– NO EXPLICIT SUPPORT FOR COMMUNICATION, BECAUSE SHARED MEMORY
IS IMPLICIT; A POINTER TO SHARED DATA IS PASSED TO A THREAD
FORKING POSIX THREADS
Signature:
int pthread_create(pthread_t *,const pthread_attr_t *,void * (*)(void *),void *);
Example call:
errcode = pthread_create(&thread_id; &thread_attribute &thread_fun; &fun_arg);
• THREAD_ID IS THE THREAD ID OR HANDLE (USED TO HALT, ETC.)
• THREAD_ATTRIBUTE VARIOUS ATTRIBUTES
– STANDARD DEFAULT VALUES OBTAINED BY PASSING A NULL POINTER
– SAMPLE ATTRIBUTES: MINIMUM STACK SIZE, PRIORITY
• THREAD_FUN THE FUNCTION TO BE RUN (TAKES AND RETURNS VOID*)
• FUN_ARG AN ARGUMENT CAN BE PASSED TO THREAD_FUN WHEN IT STARTS
• ERRORCODE WILL BE SET NONZERO IF THE CREATE OPERATION FAILS
“SIMPLE” THREADING EXAMPLE
VOID* SAYHELLO(VOID *FOO) {
PRINTF( "HELLO, WORLD!\N" );
RETURN NULL; Compile using gcc –lpthread
}
INT MAIN() {
PTHREAD_T THREADS[16];
INT TN;
FOR(TN=0; TN<16; TN++) {
PTHREAD_CREATE(&THREADS[TN], NULL, SAYHELLO, NULL);
}
FOR(TN=0; TN<16 ; TN++) {
PTHREAD_JOIN(THREADS[TN], NULL);
}
RETURN 0;
}
LOOP LEVEL PARALLELISM
• MANY SCIENTIFIC APPLICATION HAVE PARALLELISM IN LOOPS
– WITH THREADS:
… MY_STUFF [N][N];
FOR (INT I = 0; I < N; I++)
FOR (INT J = 0; J < N; J++)
… PTHREAD_CREATE (UPDATE_CELL[I][J], …,
MY_STUFF[I][J]);
• BUT OVERHEAD OF THREAD CREATION IS NONTRIVIAL
– UPDATE_CELL SHOULD HAVE A SIGNIFICANT AMOUNT OF WORK
– 1/P-TH OF TOTAL WORK IF POSSIBLE
RECALL DATA RACE EXAMPLE
static int s = 0;
Thread 1 Thread 2
for i = 0, n/2-1 for i = n/2, n-1
s = s + f(A[i]) s = s + f(A[i])
• Problem is a race condition on variable s in the program
• A race condition or data race occurs when:
- two processors (or two threads) access the same variable, and at least one does
a write.
- The accesses are concurrent (not synchronized) so they could happen
simultaneously
BASIC TYPES OF SYNCHRONIZATION: MUTEXES
MUTEXES -- MUTUAL EXCLUSION AKA LOCKS
– THREADS ARE WORKING MOSTLY INDEPENDENTLY
– NEED TO ACCESS COMMON DATA STRUCTURE
LOCK *L = ALLOC_AND_INIT(); /* SHARED */
ACQUIRE(L);
ACCESS DATA
RELEASE(L);
– LOCKS ONLY AFFECT PROCESSORS USING THEM:
– IF A THREAD ACCESSES THE DATA WITHOUT DOING THE ACQUIRE/RELEASE, LOCKS BY
OTHERS WILL NOT HELP
– JAVA, C++, AND OTHER LANGUAGES HAVE LEXICALLY SCOPED
SYNCHRONIZATION, I.E., SYNCHRONIZED METHODS/BLOCKS
– CAN’T FORGOT TO SAY “RELEASE”
– SEMAPHORES (A SIGNALING MECHANISM) GENERALIZE LOCKS TO ALLOW K
THREADS SIMULTANEOUS ACCESS; GOOD FOR LIMITED RESOURCES.
– UNLIKE IN A MUTEX, A SEMAPHORE CAN BE DECREMENTED BY ANOTHER PROCESS (A MUTEX
CAN ONLY BE UNLOCKED BY ITS OWNER)
MUTEXES IN POSIX THREADS
• TO CREATE A MUTEX:
#INCLUDE <PTHREAD.H>
PTHREAD_MUTEX_T AMUTEX = PTHREAD_MUTEX_INITIALIZER;
// OR PTHREAD_MUTEX_INIT(&AMUTEX, NULL);
• TO USE IT:
INT PTHREAD_MUTEX_LOCK(AMUTEX);
INT PTHREAD_MUTEX_UNLOCK(AMUTEX);
• TO DEALLOCATE A MUTEX
INT PTHREAD_MUTEX_DESTROY(PTHREAD_MUTEX_T *MUTEX);
• MULTIPLE MUTEXES MAY BE HELD, BUT CAN LEAD TO PROBLEMS:
THREAD1 THREAD2
LOCK(A) LOCK(B)
LOCK(B) LOCK(A) deadlock
• Deadlock results if both threads acquire one of their locks, so that neither can
acquire the second.
SUMMARY OF PROGRAMMING WITH THREADS
• POSIX THREADS ARE BASED ON OS FEATURES
– CAN BE USED FROM MULTIPLE LANGUAGES (NEED APPROPRIATE HEADER)
– FAMILIAR LANGUAGE FOR MOST OF PROGRAM
– ABILITY TO SHARED DATA IS CONVENIENT
• PITFALLS
– OVERHEAD OF THREAD CREATION IS HIGH (1-LOOP ITERATION PROBABLY TOO MUCH)
– DATA RACE BUGS ARE VERY NASTY TO FIND BECAUSE THEY CAN BE INTERMITTENT
– DEADLOCKS ARE USUALLY EASIER, BUT CAN ALSO BE INTERMITTENT
• RESEARCHERS LOOK AT TRANSACTIONAL MEMORY AN ALTERNATIVE
• OPENMP IS COMMONLY USED TODAY AS AN ALTERNATIVE
– HELPS WITH SOME OF THESE, BUT DOESN’T MAKE THEM DISAPPEAR
WHAT IS OPENMP?
• OPENMP = OPEN SPECIFICATION FOR MULTI-PROCESSING
– OPENMP.ORG – TALKS, EXAMPLES, FORUMS, ETC.
– SPEC CONTROLLED BY THE ARB
• MOTIVATION: CAPTURE COMMON USAGE AND SIMPLIFY PROGRAMMING
• OPENMP ARCHITECTURE REVIEW BOARD (ARB)
– A NONPROFIT ORGANIZATION THAT CONTROLS THE OPENMP SPEC
– LATEST SPEC: OPENMP 5.0 (NOV. 2018)
• HIGH-LEVEL API FOR PROGRAMMING IN C/C++ AND FORTRAN
– PREPROCESSOR (COMPILER) DIRECTIVES ( ~ 80% )
#PRAGMA OMP CONSTRUCT [CLAUSE [CLAUSE …]]
– LIBRARY CALLS ( ~ 19% )
#INCLUDE <OMP.H>
– ENVIRONMENT VARIABLES ( ~ 1% )
ALL CAPS, ADDED TO SRUN, ETC.
A PROGRAMMER’S VIEW OF OPENMP
• OPENMP IS A PORTABLE, THREADED, SHARED-MEMORY PROGRAMMING
SPECIFICATION WITH “LIGHT” SYNTAX
– REQUIRES COMPILER SUPPORT (C, C++ OR FORTRAN)
• OPENMP WILL:
– ALLOW A PROGRAMMER TO SEPARATE A PROGRAM INTO SERIAL REGIONS AND PARALLEL
REGIONS, RATHER THAN P CONCURRENTLY-EXECUTING THREADS.
– HIDE STACK MANAGEMENT
– PROVIDE SYNCHRONIZATION CONSTRUCTS
• OPENMP WILL NOT:
– PARALLELIZE AUTOMATICALLY
– GUARANTEE SPEEDUP
– PROVIDE FREEDOM FROM DATA RACES
THE GROWTH OF COMPLEXITY IN OPENMP
• OPENMP STARTED OUT IN 1997 AS A SIMPLE INTERFACE FOR THE APPLICATION PROGRAMMERS MORE VERSED IN
THEIR AREA OF SCIENCE THAN COMPUTER SCIENCE.
• THE COMPLEXITY HAS GROWN CONSIDERABLY OVER THE YEARS!
Page counts (not counting front matter, appendices or index) for versions of OpenMP
350
4.5
300 Fortran spec
Page counts (spec only)
C/C++ spec 4.0
250
Merged C/C++ and Fortran spec
200
3.1
3.0
150 2.5
100 2.0 2.0
1.0 1.0 1.1
50
0
1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
year
OpenMP 5.0 (November 2018) is actually 666 pages.
The OpenMP Common Core: Most OpenMP programs only use these 19 items
OpenMP pragma, function, or clause Concepts
#pragma omp parallel Parallel region, teams of threads, structured block, interleaved execution
across threads
int omp_get_thread_num() Create threads with a parallel region and split up the work using the
int omp_get_num_threads() number of threads and thread ID
double omp_get_wtime() Speedup and Amdahl's law.
False Sharing and other performance issues
setenv OMP_NUM_THREADS N Internal control variables. Setting the default number of threads with an
environment variable
#pragma omp barrier Synchronization and race conditions. Revisit interleaved execution.
#pragma omp critical
#pragma omp for Worksharing, parallel loops, loop carried dependencies
#pragma omp parallel for
reduction(op:list) Reductions of values across a team of threads
schedule(dynamic [,chunk]) Loop schedules, loop overheads and load balance
schedule (static [,chunk])
private(list), firstprivate(list), shared(list) Data environment
nowait Disabling implied barriers on workshare constructs, the high cost of
barriers, and the flush concept (but not the flush directive)
#pragma omp single Workshare with a single thread
#pragma omp task Tasks including the data environment for tasks.
#pragma omp taskwait
OPENMP BASIC DEFINITIONS: BASIC SOLUTION STACK
End User
Application
Directives, Environment
OpenMP library
Compiler variables
OpenMP Runtime library
OS/system support for shared memory and threading
Proc1 Proc2 Proc3 ProcN
Shared Address Space
OPENMP BASIC SYNTAX
• MOST OF THE CONSTRUCTS IN OPENMP ARE COMPILER DIRECTIVES.
C and C++ Fortran
Compiler directives
#pragma omp construct [clause [clause]…] !$OMP construct [clause [clause] …]
Example
#pragma omp parallel private(x) !$OMP PARALLEL
{
} !$OMP END PARALLEL
Function prototypes and types:
#include <omp.h> use OMP_LIB
• Most OpenMP* constructs apply to a “structured block”.
– Structured block: a block of one or more statements with one point of entry at the top and one point of exit at the bottom.
– It’s OK to have an exit() within the structured block.
HELLO WORLD IN OPENMP
• WRITE A PROGRAM THAT PRINTS “HELLO WORLD”.
#include<stdio.h>
int main()
{
printf(“ hello ”);
printf(“ world \n”);
}
HELLO WORLD IN OPENMP
• WRITE A MULTITHREADED PROGRAM THAT PRINTS “HELLO WORLD”.
#include <omp.h> Switches for compiling and linking
#include <stdio.h>
int main() gcc –fopenmp Gnu (Linux, OSX)
{ pgcc -mp pgi PGI (Linux)
#pragma omp parallel
icl /Qopenmp Intel (windows)
{
icc –fopenmp Intel (Linux, OSX)
printf(“ hello ”);
printf(“ world \n”);
}
}
}
HELLO WORLD IN OPENMP
• WRITE A MULTITHREADED PROGRAM WHERE EACH THREAD PRINTS “HELLO WORLD”.
#include <omp.h> OpenMP include file
#include <stdio.h>
int main()
{ Sample Output:
Parallel region with default
#pragma omp parallel number of threads hello hello world
{ world
hello hello world
printf(“ hello ”); world
printf(“ world \n”);
}
}
End of the Parallel region
The statements are interleaved based on how the operating schedules the threads
OPENMP PROGRAMMING MODEL:
Fork-Join Parallelism:
◆ Master thread spawns a team of threads as needed.
◆ Parallelism added incrementally until performance goals are met, i.e., the sequential program
evolves into a parallel program.
Parallel Regions
A Nested
Master Parallel
Thread region
in red
Sequential Parts
THREAD CREATION: PARALLEL REGIONS
• YOU CREATE THREADS IN OPENMP* WITH THE PARALLEL CONSTRUCT.
• FOR EXAMPLE, TO CREATE A 4 THREAD PARALLEL REGION:
double A[1000]; Runtime function to
Each thread omp_set_num_threads(4); request a certain
executes a #pragma omp parallel number of threads
copy of the {
code within int ID = omp_get_thread_num();
the pooh(ID,A);
structured }
Runtime function
block
returning a thread ID
● Each thread calls pooh(ID,A) for ID = 0 to 3
* The name “OpenMP” is the property of the OpenMP Architecture Review Board
THREAD CREATION: PARALLEL REGIONS EXAMPLE
double A[1000];
• EACH THREAD EXECUTES THE omp_set_num_threads(4);
SAME CODE REDUNDANTLY. #pragma omp parallel
{
int ID = omp_get_thread_num();
double A[1000]; pooh(ID, A);
}
omp_set_num_threads(4) printf(“all done\n”);
A single
copy of A is
shared pooh(0,A) pooh(1,A) pooh(2,A) pooh(3,A)
between all
threads.
printf(“all done\n”); Threads wait here for all threads to finish
before proceeding (i.e., a barrier)
THREAD CREATION: HOW MANY THREADS DID YOU ACTUALLY GET?
• YOU CREATE A TEAM THREADS IN OPENMP* WITH THE PARALLEL CONSTRUCT.
• YOU CAN REQUEST A NUMBER OF THREADS WITH OMP_SET_NUM_THREADS()
• BUT IS THE NUMBER OF THREADS REQUESTED THE NUMBER YOU ACTUALLY GET?
– NO! AN IMPLEMENTATION CAN SILENTLY DECIDE TO GIVE YOU A TEAM WITH FEWER THREADS.
– ONCE A TEAM OF THREADS IS ESTABLISHED … THE SYSTEM WILL NOT REDUCE THE SIZE OF THE TEAM.
double A[1000]; Runtime function to
Each thread
omp_set_num_threads(4); request a certain
executes a
#pragma omp parallel number of threads
copy of the
{
code within
int ID = omp_get_thread_num();
the
structured int nthrds = omp_get_num_threads();
block pooh(ID,A);
} Runtime function to
return actual
● Each thread calls pooh(ID,A) for ID = 0 to nthrds-1 number of threads
in the team
AN INTERESTING PROBLEM TO PLAY WITH NUMERICAL INTEGRATION
Mathematically, we know that:
1
4.0
4.0
∫
0
(1+x2) dx = π
We can approximate the integral as a
sum of rectangles:
2.0
∑ F(x )Δx ≈ π
i
i=0
Where each rectangle has width Δx and
1.0
0.0 height F(xi) at the middle of interval i.
X
SERIAL PI PROGRAM
static long num_steps = 100000;
double step;
int main ()
{ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}
SERIAL PI PROGRAM
#include <omp.h>
static long num_steps = 100000;
double step;
int main ()
{ int i; double x, pi, sum = 0.0, tdata;
step = 1.0/(double) num_steps;
tdata = omp_get_wtime(); The library routine
for (i=0;i< num_steps; i++){ get_omp_wtime()
x = (i+0.5)*step; is used to find the
sum = sum + 4.0/(1.0+x*x); elapsed “wall
} time” for blocks of
pi = step * sum; code
tdata = omp_get_wtime() - tdata;
printf(“ pi = %f in %f secs\n”,pi, tdata);
}
SHARED MEMORY HARDWARE
AND
MEMORY CONSISTENCY
BASIC SHARED MEMORY ARCHITECTURE
• PROCESSORS ALL CONNECTED TO A LARGE SHARED MEMORY
– WHERE ARE CACHES?
P1 P2 Pn
interconnect
memory
• Now take a closer look at structure, costs, limits, programming
WHAT ABOUT CACHING???
P1 Pn
$ $
Bus
Mem I/O devices
• WANT HIGH PERFORMANCE FOR SHARED MEMORY: USE CACHES!
– EACH PROCESSOR HAS ITS OWN CACHE (OR MULTIPLE CACHES)
– PLACE DATA FROM MEMORY INTO CACHE
– WRITEBACK CACHE: DON’T SEND ALL WRITES OVER BUS TO MEMORY
• CACHES REDUCE AVERAGE LATENCY
– AUTOMATIC REPLICATION CLOSER TO PROCESSOR
– MORE IMPORTANT TO MULTIPROCESSOR THAN UNIPROCESSOR: LATENCIES LONGER
• NORMAL UNIPROCESSOR MECHANISMS TO ACCESS DATA
– LOADS AND STORES FORM VERY LOW-OVERHEAD COMMUNICATION PRIMITIVE
• PROBLEM: CACHE COHERENCE!
EXAMPLE CACHE COHERENCE PROBLEM
P1 P2 P3
u=? 3
u=?
4 5 $
$ $
u :5 u :5 u= 7
I/O devices
1
2
u:5
Memory
• THINGS TO NOTE:
– PROCESSORS COULD SEE DIFFERENT VALUES FOR U AFTER EVENT 3
– WITH WRITE BACK CACHES, VALUE WRITTEN BACK TO MEMORY DEPENDS ON HAPPENSTANCE OF WHICH
CACHE FLUSHES OR WRITES BACK VALUE WHEN
• HOW TO FIX WITH A BUS: COHERENCE PROTOCOL
– USE BUS TO BROADCAST WRITES OR INVALIDATIONS
– SIMPLE PROTOCOLS RELY ON PRESENCE OF BROADCAST MEDIUM
• BUS NOT SCALABLE BEYOND ABOUT 100 PROCESSORS (MAX)
– CAPACITY, BANDWIDTH LIMITATIONS
SNOOPY CACHE-COHERENCE PROTOCOLS
State
Address P0 Pn
Data
$ bus $
snoop memory bus
memory op from Pn
Mem Mem
• MEMORY BUS IS A BROADCAST MEDIUM
• CACHES CONTAIN INFORMATION ON WHICH ADDRESSES THEY STORE
• CACHE CONTROLLER “SNOOPS” ALL TRANSACTIONS ON THE BUS
– A TRANSACTION IS A RELEVANT TRANSACTION IF IT INVOLVES A CACHE BLOCK CURRENTLY
CONTAINED IN THIS CACHE
– TAKE ACTION TO ENSURE COHERENCE
– INVALIDATE, UPDATE, OR SUPPLY VALUE
– MANY POSSIBLE DESIGNS (SEE CS252 OR CS258)
INTUITIVE MEMORY MODEL
• READING AN ADDRESS SHOULD RETURN THE LAST VALUE WRITTEN TO THAT ADDRESS
• EASY IN UNIPROCESSORS
– EXCEPT FOR I/O
• CACHE COHERENCE PROBLEM IN MPS IS MORE PERVASIVE AND MORE PERFORMANCE
CRITICAL
• MORE FORMALLY, THIS IS CALLED SEQUENTIAL CONSISTENCY:
“A MULTIPROCESSOR IS SEQUENTIALLY CONSISTENT IF THE RESULT OF ANY EXECUTION
IS THE SAME AS IF THE OPERATIONS OF ALL THE PROCESSORS WERE EXECUTED IN
SOME SEQUENTIAL ORDER, AND THE OPERATIONS OF EACH INDIVIDUAL PROCESSOR
APPEAR IN THIS SEQUENCE IN THE ORDER SPECIFIED BY ITS PROGRAM.” [LAMPORT,
1979]
#include <omp.h>
static long num_steps = 100000; double step;
#define NUM_THREADS 4
void main ()
{ int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{ int i, id,nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if (id == 0) nthreads = nthrds;
for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) {
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i=0, pi=0.0;i<nthreads;i++)pi += sum[i] * step;
}
PARALLEL PI PROGRAM (2013)
• Original Serial pi program with 100 million steps ran in 1.83 seconds*.
threads 1st
SPMD*
1 1.86
2 1.03
3 1.08
4 0.97
* Intel compiler (icpc) with default optimization level (O2) on Apple OS X 10.7.3 with a dual core (four HW thread) Intel®
CoreTM i5 processor at 1.7 Ghz and 4 Gbyte DDR3 memory at 1.333 Ghz.
PARALLEL PI PROGRAM (2020)
• Original Serial pi program with 1 billion steps ran in 0.985 seconds*.
threads 1st
SPMD*
1 1.102
2 0.512
4 0.280
8 0.276
*GCC with -O3 optimization on Apple macOS X 10.14.6 with a quad core (8 HW threads) 2.8 GHz Intel Core i7 and 16 Gbyte
LPDDR3 memory at 2.133 Ghz.
WHY SUCH POOR SCALING? FALSE SHARING
• IF INDEPENDENT DATA ELEMENTS HAPPEN TO SIT ON THE SAME CACHE LINE, EACH UPDATE WILL CAUSE THE
CACHE LINES TO “SLOSH BACK AND FORTH” BETWEEN THREADS … THIS IS CALLED “FALSE SHARING”.
HW thrd. 0 HW thrd. 1 HW thrd. 2 HW thrd. 3
L1 $ lines L1 $ lines
Sum[0] Sum[1] Sum[2] Sum[3] Sum[0] Sum[1] Sum[2] Sum[3]
Core 0 Core 1
Shared last level cache and connection to I/O and DRAM
• If you promote scalars to an array to support creation of an SPMD program, the array elements are
contiguous in memory and hence share cache lines … Results in poor scalability.
• Solution: Pad arrays so elements you use are on distinct cache lines.
EXAMPLE: ELIMINATE FALSE SHARING BY PADDING THE SUM ARRAY
#include <omp.h>
static long num_steps = 100000; double step;
#define PAD 8 // assume 64 byte L1 cache line size
#define NUM_THREADS 4
void main ()
{ int i, nthreads; double pi, sum[NUM_THREADS][PAD];
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel Pad the array so
{ int i, id,nthrds; each sum value is
double x;
in a different cache
id = omp_get_thread_num();
line
nthrds = omp_get_num_threads();
if (id == 0) nthreads = nthrds;
for (i=id, sum[id][0]=0.0;i< num_steps; i=i+nthrds) {
x = (i+0.5)*step;
sum[id][0] += 4.0/(1.0+x*x);
}
}
for(i=0, pi=0.0;i<nthreads;i++)pi += sum[i][0] * step;
}
RESULTS*: PI PROGRAM PADDED ACCUMULATOR (2013)
• Original Serial pi program with 100000000 steps ran in 1.83 seconds.
threads 1st SPMD 1st SPMD
padded
1 1.86 1.86
2 1.03 1.01
3 1.08 0.69
4 0.97 0.53
*Intel compiler (icpc) with default optimization level (O2) on Apple OS X 10.7.3 with a dual core (four
HW thread) Intel® CoreTM i5 processor at 1.7 Ghz and 4 Gbyte DDR3 memory at 1.333 Ghz.
RESULTS*: PI PROGRAM PADDED ACCUMULATOR (2020)
• Original Serial pi program with 1 billion steps ran in 0.985 seconds*.
threads 1st SPMD 1st SPMD
padded
1 1.102 0.987
2 0.512 0.496
4 0.280 0.271
8 0.276 0.268
*GCC with -O3 optimization on Apple macOS X 10.14.6 with a quad core (8 HW threads) 2.8 GHz Intel Core i7 and 16 Gbyte
LPDDR3 memory at 2.133 Ghz.