0% found this document useful (0 votes)

9 views48 pages

PDC Lecture 05

The document covers the architecture and programming models of parallel and distributed systems, focusing on shared-address-space and message-passing platforms. It discusses the implications of modern CPU architectures on software performance, the importance of locality, and various parallel programming paradigms including POSIX threads and OpenMP. Additionally, it highlights the challenges of synchronization, data races, and the evolution of parallel programming languages and standards.

Uploaded by

arhamkhan4241

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views48 pages

PDC Lecture 05

Uploaded by

arhamkhan4241

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

CS-402 Parallel and Distributed Systems

Fall 2024

 Quick Review Lecture No. 05

 Modern CPU core architecture
o Superscalar (pipeline, multiple issues)
o Branch prediction
o Out of order execution
o Many execution units
o Memory hierarchy
 Implication on software performance
 Locality – temporal locality and spatial locality
COMMUNICATION MODEL OF PARALLEL PLATFORMS

There are two primary forms of data exchange between parallel tasks – accessing a shared

data space and exchanging messages.

1. Platforms that provide a shared data space are called shared address-space machines or

multiprocessors.

2. Platforms that support messaging are also called message passing platforms or multi

computers.
SHARED-ADDRESS-SPACE PLATFORMS
Part (or all) of the memory is accessible to all processors.
• Processors interact by modifying data objects stored in this shared-address-space.
• If the time taken by a processor to access any memory word in the system global or local is identical,
the platform is classified as a uniform memory access (UMA), else, a non-uniform memory access
(NUMA) machine.
NUMA and UMA Shared-Address-Space Platforms:
The distinction between NUMA and UMA platforms is important from the point of view of algorithm
design. NUMA machines require locality from underlying algorithms for performance. Programming
these platforms is easier since reads and writes are implicitly visible to other processors.
However, read-write data to shared data must be coordinated (this will be discussed in greater detail
when we talk about threads programming).

• Caches in such machines require coordinated access to multiple copies. This leads to the cache
coherence problem.
• A weaker model of these machines provides an address map, but not coordinated access. These
models are called non cache coherent shared address space machines.
SHARED-ADDRESS-SPACE VS. SHARED MEMORY MACHINES
It is important to note the difference between the terms shared address space and shared
memory. We refer to the former as a programming abstraction and to the latter as a physical
machine attribute. It is possible to provide a shared address space using a physically
distributed memory.
Physical Organization of Parallel Platforms
We begin this discussion with an ideal parallel machine called Parallel Random Access Machine,
or PRAM.
Architecture of an Ideal Parallel Computer
A natural extension of the Random Access Machine (RAM) serial architecture is the Parallel
Random Access Machine, or PRAM.
PRAMs consist of p processors and a global memory of unbounded size that is uniformly
accessible to all processors.
Processors share a common clock but may execute different instructions in each cycle.
ARCHITECTURE OF AN IDEAL PARALLEL COMPUTER
Depending on how simultaneous memory accesses are handled, PRAMs can be divided into four
subclasses.
• Exclusive-read, exclusive-write (EREW) PRAM.
• Concurrent-read, exclusive-write (CREW) PRAM
• Exclusive-read, concurrent-write (ERCW) PRAM.
• Concurrent-read, concurrent-write (CRCW) PRAM.
• Common: write only if all values are identical.
• Arbitrary: write the data from a randomly selected processor.
• Priority: follow a predetermined priority order.
• Sum: Write the sum of all data items.
ARCHITECTURE OF AN IDEAL PARALLEL COMPUTER
Depending on how simultaneous memory accesses are handled, PRAMs can be divided into four
subclasses.
• Exclusive-read, exclusive-write (EREW) PRAM.
• Concurrent-read, exclusive-write (CREW) PRAM
• Exclusive-read, concurrent-write (ERCW) PRAM.
• Concurrent-read, concurrent-write (CRCW) PRAM.
• Common: write only if all values are identical.
• Arbitrary: write the data from a randomly selected processor.
• Priority: follow a predetermined priority order.
• Sum: Write the sum of all data items.
A GENERIC PARALLEL ARCHITECTURE

Proc Proc Proc

Proc
Proc
Proc

Interconnection Network

Memory
Memory Memory Memory
Memory

• Where is the memory physically located?

• Is it connected directly to processors?
• What is the connectivity of the network?
A BRIEF HISTORY OF PARALLEL LANGUAGES
• WHEN VECTOR MACHINES WERE KING
– PARALLEL “LANGUAGES” WERE LOOP ANNOTATIONS (IVDEP)
Mapping to MPPs/Clusters
– PERFORMANCE WAS FRAGILE, BUT GOOD USER SUPPORT proved hard
• WHEN SIMD MACHINES WERE KING
– DATA PARALLEL LANGUAGES POPULAR AND SUCCESSFUL (CMF, *LISP, C*, …)
– IRREGULAR DATA (SPARSE MAT-VEC MULTIPLY OK), BUT IRREGULAR COMPUTATION (DIVIDE AND
CONQUER, ADAPTIVE MESHES, ETC.) LESS CLEAR
• WHEN SHARED MEMORY MULTIPROCESSORS (SMPS) WERE KING
– SHARED MEMORY MODELS, E.G., POSIX THREADS, OPENMP, ARE POPULAR
• WHEN CLUSTERS TOOK OVER
– MESSAGE PASSING (MPI) BECAME DOMINANT
• WITH THE ADDITION OF ACCELERATORS
– OPENACC, CUDA WERE ADDED You’ll see the most
• IN CLOUD COMPUTING popular in each category
– HADOOP, SPARK, …
OUTLINE
• SHARED MEMORY PARALLELISM WITH THREADS
• WHAT AND WHY OPENMP?
• PARALLEL PROGRAMMING WITH OPENMP
• INTRODUCTION TO OPENMP
1. CREATING PARALLELISM
2. PARALLEL LOOPS
3. SYNCHRONIZING
4. DATA SHARING
• BENEATH THE HOOD
– SHARED MEMORY HARDWARE
• SUMMARY
RECALL PROGRAMMING MODEL 1: SHARED MEMORY
• PROGRAM IS A COLLECTION OF THREADS OF CONTROL.
– CAN BE CREATED DYNAMICALLY, MID-EXECUTION, IN SOME LANGUAGES
• EACH THREAD HAS A SET OF PRIVATE VARIABLES, E.G., LOCAL STACK VARIABLES
• ALSO A SET OF SHARED VARIABLES, E.G., STATIC VARIABLES, SHARED COMMON BLOCKS,
OR GLOBAL HEAP.
– THREADS COMMUNICATE IMPLICITLY BY WRITING AND READING SHARED VARIABLES.
– THREADS COORDINATE BY SYNCHRONIZING ON SHARED VARIABLES

Shared memory
s
s = ...
y = ..s ...
i: 2 i: 5 Private i: 8
memory

P0 P1 Pn
PARALLEL PROGRAMMING
WITH THREADS
OVERVIEW OF POSIX THREADS
POSIX: PORTABLE OPERATING SYSTEM INTERFACE
– INTERFACE TO OPERATING SYSTEM UTILITIES
• PTHREADS: THE POSIX THREADING INTERFACE
– SYSTEM CALLS TO CREATE AND SYNCHRONIZE THREADS
– SHOULD BE RELATIVELY UNIFORM ACROSS UNIX-LIKE OS PLATFORMS
• PTHREADS CONTAIN SUPPORT FOR
– CREATING PARALLELISM
– SYNCHRONIZING
– NO EXPLICIT SUPPORT FOR COMMUNICATION, BECAUSE SHARED MEMORY
IS IMPLICIT; A POINTER TO SHARED DATA IS PASSED TO A THREAD
FORKING POSIX THREADS
Signature:
int pthread_create(pthread_t *,const pthread_attr_t *,void * (*)(void *),void *);
Example call:
errcode = pthread_create(&thread_id; &thread_attribute &thread_fun; &fun_arg);

• THREAD_ID IS THE THREAD ID OR HANDLE (USED TO HALT, ETC.)

• THREAD_ATTRIBUTE VARIOUS ATTRIBUTES
– STANDARD DEFAULT VALUES OBTAINED BY PASSING A NULL POINTER
– SAMPLE ATTRIBUTES: MINIMUM STACK SIZE, PRIORITY
• THREAD_FUN THE FUNCTION TO BE RUN (TAKES AND RETURNS VOID*)
• FUN_ARG AN ARGUMENT CAN BE PASSED TO THREAD_FUN WHEN IT STARTS
• ERRORCODE WILL BE SET NONZERO IF THE CREATE OPERATION FAILS
“SIMPLE” THREADING EXAMPLE
VOID* SAYHELLO(VOID *FOO) {
PRINTF( "HELLO, WORLD!\N" );
RETURN NULL; Compile using gcc –lpthread
}

INT MAIN() {
PTHREAD_T THREADS[16];
INT TN;
FOR(TN=0; TN<16; TN++) {
PTHREAD_CREATE(&THREADS[TN], NULL, SAYHELLO, NULL);
}
FOR(TN=0; TN<16 ; TN++) {
PTHREAD_JOIN(THREADS[TN], NULL);
}
RETURN 0;
}
LOOP LEVEL PARALLELISM
• MANY SCIENTIFIC APPLICATION HAVE PARALLELISM IN LOOPS
– WITH THREADS:
… MY_STUFF [N][N];
FOR (INT I = 0; I < N; I++)
FOR (INT J = 0; J < N; J++)
… PTHREAD_CREATE (UPDATE_CELL[I][J], …,
MY_STUFF[I][J]);

• BUT OVERHEAD OF THREAD CREATION IS NONTRIVIAL

– UPDATE_CELL SHOULD HAVE A SIGNIFICANT AMOUNT OF WORK
– 1/P-TH OF TOTAL WORK IF POSSIBLE
RECALL DATA RACE EXAMPLE
static int s = 0;

Thread 1 Thread 2

for i = 0, n/2-1 for i = n/2, n-1

s = s + f(A[i]) s = s + f(A[i])

• Problem is a race condition on variable s in the program

• A race condition or data race occurs when:
- two processors (or two threads) access the same variable, and at least one does
a write.
- The accesses are concurrent (not synchronized) so they could happen
simultaneously
BASIC TYPES OF SYNCHRONIZATION: MUTEXES
MUTEXES -- MUTUAL EXCLUSION AKA LOCKS
– THREADS ARE WORKING MOSTLY INDEPENDENTLY
– NEED TO ACCESS COMMON DATA STRUCTURE
LOCK *L = ALLOC_AND_INIT(); /* SHARED */
ACQUIRE(L);
ACCESS DATA
RELEASE(L);
– LOCKS ONLY AFFECT PROCESSORS USING THEM:
– IF A THREAD ACCESSES THE DATA WITHOUT DOING THE ACQUIRE/RELEASE, LOCKS BY
OTHERS WILL NOT HELP
– JAVA, C++, AND OTHER LANGUAGES HAVE LEXICALLY SCOPED
SYNCHRONIZATION, I.E., SYNCHRONIZED METHODS/BLOCKS
– CAN’T FORGOT TO SAY “RELEASE”
– SEMAPHORES (A SIGNALING MECHANISM) GENERALIZE LOCKS TO ALLOW K
THREADS SIMULTANEOUS ACCESS; GOOD FOR LIMITED RESOURCES.
– UNLIKE IN A MUTEX, A SEMAPHORE CAN BE DECREMENTED BY ANOTHER PROCESS (A MUTEX
CAN ONLY BE UNLOCKED BY ITS OWNER)
MUTEXES IN POSIX THREADS
• TO CREATE A MUTEX:
#INCLUDE <PTHREAD.H>
PTHREAD_MUTEX_T AMUTEX = PTHREAD_MUTEX_INITIALIZER;
// OR PTHREAD_MUTEX_INIT(&AMUTEX, NULL);
• TO USE IT:
INT PTHREAD_MUTEX_LOCK(AMUTEX);
INT PTHREAD_MUTEX_UNLOCK(AMUTEX);
• TO DEALLOCATE A MUTEX
INT PTHREAD_MUTEX_DESTROY(PTHREAD_MUTEX_T *MUTEX);
• MULTIPLE MUTEXES MAY BE HELD, BUT CAN LEAD TO PROBLEMS:
THREAD1 THREAD2
LOCK(A) LOCK(B)
LOCK(B) LOCK(A) deadlock

• Deadlock results if both threads acquire one of their locks, so that neither can
acquire the second.
SUMMARY OF PROGRAMMING WITH THREADS
• POSIX THREADS ARE BASED ON OS FEATURES
– CAN BE USED FROM MULTIPLE LANGUAGES (NEED APPROPRIATE HEADER)
– FAMILIAR LANGUAGE FOR MOST OF PROGRAM
– ABILITY TO SHARED DATA IS CONVENIENT

• PITFALLS
– OVERHEAD OF THREAD CREATION IS HIGH (1-LOOP ITERATION PROBABLY TOO MUCH)
– DATA RACE BUGS ARE VERY NASTY TO FIND BECAUSE THEY CAN BE INTERMITTENT
– DEADLOCKS ARE USUALLY EASIER, BUT CAN ALSO BE INTERMITTENT

• RESEARCHERS LOOK AT TRANSACTIONAL MEMORY AN ALTERNATIVE

• OPENMP IS COMMONLY USED TODAY AS AN ALTERNATIVE
– HELPS WITH SOME OF THESE, BUT DOESN’T MAKE THEM DISAPPEAR
WHAT IS OPENMP?
• OPENMP = OPEN SPECIFICATION FOR MULTI-PROCESSING
– OPENMP.ORG – TALKS, EXAMPLES, FORUMS, ETC.
– SPEC CONTROLLED BY THE ARB
• MOTIVATION: CAPTURE COMMON USAGE AND SIMPLIFY PROGRAMMING
• OPENMP ARCHITECTURE REVIEW BOARD (ARB)
– A NONPROFIT ORGANIZATION THAT CONTROLS THE OPENMP SPEC
– LATEST SPEC: OPENMP 5.0 (NOV. 2018)
• HIGH-LEVEL API FOR PROGRAMMING IN C/C++ AND FORTRAN
– PREPROCESSOR (COMPILER) DIRECTIVES ( ~ 80% )
#PRAGMA OMP CONSTRUCT [CLAUSE [CLAUSE …]]
– LIBRARY CALLS ( ~ 19% )
#INCLUDE <OMP.H>
– ENVIRONMENT VARIABLES ( ~ 1% )
ALL CAPS, ADDED TO SRUN, ETC.
A PROGRAMMER’S VIEW OF OPENMP
• OPENMP IS A PORTABLE, THREADED, SHARED-MEMORY PROGRAMMING
SPECIFICATION WITH “LIGHT” SYNTAX
– REQUIRES COMPILER SUPPORT (C, C++ OR FORTRAN)

• OPENMP WILL:
– ALLOW A PROGRAMMER TO SEPARATE A PROGRAM INTO SERIAL REGIONS AND PARALLEL
REGIONS, RATHER THAN P CONCURRENTLY-EXECUTING THREADS.
– HIDE STACK MANAGEMENT
– PROVIDE SYNCHRONIZATION CONSTRUCTS

• OPENMP WILL NOT:

– PARALLELIZE AUTOMATICALLY
– GUARANTEE SPEEDUP
– PROVIDE FREEDOM FROM DATA RACES
THE GROWTH OF COMPLEXITY IN OPENMP
• OPENMP STARTED OUT IN 1997 AS A SIMPLE INTERFACE FOR THE APPLICATION PROGRAMMERS MORE VERSED IN
THEIR AREA OF SCIENCE THAN COMPUTER SCIENCE.

• THE COMPLEXITY HAS GROWN CONSIDERABLY OVER THE YEARS!

Page counts (not counting front matter, appendices or index) for versions of OpenMP
350
4.5
300 Fortran spec
Page counts (spec only)

C/C++ spec 4.0

250
Merged C/C++ and Fortran spec
200
3.1
3.0
150 2.5
100 2.0 2.0
1.0 1.0 1.1
50

0
1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
year

OpenMP 5.0 (November 2018) is actually 666 pages.

The OpenMP Common Core: Most OpenMP programs only use these 19 items

OpenMP pragma, function, or clause Concepts

#pragma omp parallel Parallel region, teams of threads, structured block, interleaved execution
across threads
int omp_get_thread_num() Create threads with a parallel region and split up the work using the
int omp_get_num_threads() number of threads and thread ID
double omp_get_wtime() Speedup and Amdahl's law.
False Sharing and other performance issues
setenv OMP_NUM_THREADS N Internal control variables. Setting the default number of threads with an
environment variable
#pragma omp barrier Synchronization and race conditions. Revisit interleaved execution.
#pragma omp critical

#pragma omp for Worksharing, parallel loops, loop carried dependencies

#pragma omp parallel for
reduction(op:list) Reductions of values across a team of threads

schedule(dynamic [,chunk]) Loop schedules, loop overheads and load balance

schedule (static [,chunk])
private(list), firstprivate(list), shared(list) Data environment

nowait Disabling implied barriers on workshare constructs, the high cost of

barriers, and the flush concept (but not the flush directive)
#pragma omp single Workshare with a single thread
#pragma omp task Tasks including the data environment for tasks.
#pragma omp taskwait
OPENMP BASIC DEFINITIONS: BASIC SOLUTION STACK

End User

Application

Directives, Environment
OpenMP library
Compiler variables

OpenMP Runtime library

OS/system support for shared memory and threading

Proc1 Proc2 Proc3 ProcN

Shared Address Space

OPENMP BASIC SYNTAX
• MOST OF THE CONSTRUCTS IN OPENMP ARE COMPILER DIRECTIVES.

C and C++ Fortran

Compiler directives
#pragma omp construct [clause [clause]…] !$OMP construct [clause [clause] …]

Example
#pragma omp parallel private(x) !$OMP PARALLEL
{

} !$OMP END PARALLEL

Function prototypes and types:

#include <omp.h> use OMP_LIB

• Most OpenMP* constructs apply to a “structured block”.

– Structured block: a block of one or more statements with one point of entry at the top and one point of exit at the bottom.
– It’s OK to have an exit() within the structured block.
HELLO WORLD IN OPENMP

• WRITE A PROGRAM THAT PRINTS “HELLO WORLD”.

#include<stdio.h>
int main()
{

printf(“ hello ”);

printf(“ world \n”);

}
HELLO WORLD IN OPENMP
• WRITE A MULTITHREADED PROGRAM THAT PRINTS “HELLO WORLD”.

#include <omp.h> Switches for compiling and linking

#include <stdio.h>
int main() gcc –fopenmp Gnu (Linux, OSX)
{ pgcc -mp pgi PGI (Linux)
#pragma omp parallel
icl /Qopenmp Intel (windows)
{
icc –fopenmp Intel (Linux, OSX)

printf(“ hello ”);

printf(“ world \n”);

}
}
}
HELLO WORLD IN OPENMP
• WRITE A MULTITHREADED PROGRAM WHERE EACH THREAD PRINTS “HELLO WORLD”.

#include <omp.h> OpenMP include file

#include <stdio.h>
int main()
{ Sample Output:
Parallel region with default
#pragma omp parallel number of threads hello hello world
{ world
hello hello world
printf(“ hello ”); world
printf(“ world \n”);
}
}
End of the Parallel region

The statements are interleaved based on how the operating schedules the threads
OPENMP PROGRAMMING MODEL:
Fork-Join Parallelism:
◆ Master thread spawns a team of threads as needed.
◆ Parallelism added incrementally until performance goals are met, i.e., the sequential program
evolves into a parallel program.

Parallel Regions
A Nested
Master Parallel
Thread region
in red

Sequential Parts
THREAD CREATION: PARALLEL REGIONS

• YOU CREATE THREADS IN OPENMP* WITH THE PARALLEL CONSTRUCT.

• FOR EXAMPLE, TO CREATE A 4 THREAD PARALLEL REGION:

double A[1000]; Runtime function to

Each thread omp_set_num_threads(4); request a certain
executes a #pragma omp parallel number of threads
copy of the {
code within int ID = omp_get_thread_num();
the pooh(ID,A);
structured }
Runtime function
block
returning a thread ID

● Each thread calls pooh(ID,A) for ID = 0 to 3

* The name “OpenMP” is the property of the OpenMP Architecture Review Board
THREAD CREATION: PARALLEL REGIONS EXAMPLE

double A[1000];
• EACH THREAD EXECUTES THE omp_set_num_threads(4);
SAME CODE REDUNDANTLY. #pragma omp parallel
{
int ID = omp_get_thread_num();
double A[1000]; pooh(ID, A);
}
omp_set_num_threads(4) printf(“all done\n”);

A single
copy of A is
shared pooh(0,A) pooh(1,A) pooh(2,A) pooh(3,A)
between all
threads.
printf(“all done\n”); Threads wait here for all threads to finish
before proceeding (i.e., a barrier)
THREAD CREATION: HOW MANY THREADS DID YOU ACTUALLY GET?
• YOU CREATE A TEAM THREADS IN OPENMP* WITH THE PARALLEL CONSTRUCT.
• YOU CAN REQUEST A NUMBER OF THREADS WITH OMP_SET_NUM_THREADS()
• BUT IS THE NUMBER OF THREADS REQUESTED THE NUMBER YOU ACTUALLY GET?
– NO! AN IMPLEMENTATION CAN SILENTLY DECIDE TO GIVE YOU A TEAM WITH FEWER THREADS.
– ONCE A TEAM OF THREADS IS ESTABLISHED … THE SYSTEM WILL NOT REDUCE THE SIZE OF THE TEAM.

double A[1000]; Runtime function to

Each thread
omp_set_num_threads(4); request a certain
executes a
#pragma omp parallel number of threads
copy of the
{
code within
int ID = omp_get_thread_num();
the
structured int nthrds = omp_get_num_threads();
block pooh(ID,A);
} Runtime function to
return actual
● Each thread calls pooh(ID,A) for ID = 0 to nthrds-1 number of threads
in the team
AN INTERESTING PROBLEM TO PLAY WITH NUMERICAL INTEGRATION
Mathematically, we know that:

1
4.0
4.0
∫
0
(1+x2) dx = π

We can approximate the integral as a

sum of rectangles:
2.0

∑ F(x )Δx ≈ π
i
i=0

Where each rectangle has width Δx and

1.0
0.0 height F(xi) at the middle of interval i.
X
SERIAL PI PROGRAM

static long num_steps = 100000;

double step;
int main ()
{ int i; double x, pi, sum = 0.0;

step = 1.0/(double) num_steps;

for (i=0;i< num_steps; i++){

x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}
SERIAL PI PROGRAM

#include <omp.h>
static long num_steps = 100000;
double step;
int main ()
{ int i; double x, pi, sum = 0.0, tdata;

step = 1.0/(double) num_steps;

tdata = omp_get_wtime(); The library routine
for (i=0;i< num_steps; i++){ get_omp_wtime()
x = (i+0.5)*step; is used to find the
sum = sum + 4.0/(1.0+x*x); elapsed “wall
} time” for blocks of
pi = step * sum; code
tdata = omp_get_wtime() - tdata;
printf(“ pi = %f in %f secs\n”,pi, tdata);
}
SHARED MEMORY HARDWARE
AND
MEMORY CONSISTENCY
BASIC SHARED MEMORY ARCHITECTURE

• PROCESSORS ALL CONNECTED TO A LARGE SHARED MEMORY

– WHERE ARE CACHES?

P1 P2 Pn

interconnect

memory

• Now take a closer look at structure, costs, limits, programming

WHAT ABOUT CACHING???
P1 Pn

$ $
Bus

Mem I/O devices

• WANT HIGH PERFORMANCE FOR SHARED MEMORY: USE CACHES!

– EACH PROCESSOR HAS ITS OWN CACHE (OR MULTIPLE CACHES)
– PLACE DATA FROM MEMORY INTO CACHE
– WRITEBACK CACHE: DON’T SEND ALL WRITES OVER BUS TO MEMORY
• CACHES REDUCE AVERAGE LATENCY
– AUTOMATIC REPLICATION CLOSER TO PROCESSOR
– MORE IMPORTANT TO MULTIPROCESSOR THAN UNIPROCESSOR: LATENCIES LONGER
• NORMAL UNIPROCESSOR MECHANISMS TO ACCESS DATA
– LOADS AND STORES FORM VERY LOW-OVERHEAD COMMUNICATION PRIMITIVE
• PROBLEM: CACHE COHERENCE!
EXAMPLE CACHE COHERENCE PROBLEM
P1 P2 P3
u=? 3
u=?
4 5 $
$ $

u :5 u :5 u= 7

I/O devices
1
2
u:5
Memory

• THINGS TO NOTE:
– PROCESSORS COULD SEE DIFFERENT VALUES FOR U AFTER EVENT 3
– WITH WRITE BACK CACHES, VALUE WRITTEN BACK TO MEMORY DEPENDS ON HAPPENSTANCE OF WHICH
CACHE FLUSHES OR WRITES BACK VALUE WHEN
• HOW TO FIX WITH A BUS: COHERENCE PROTOCOL
– USE BUS TO BROADCAST WRITES OR INVALIDATIONS
– SIMPLE PROTOCOLS RELY ON PRESENCE OF BROADCAST MEDIUM
• BUS NOT SCALABLE BEYOND ABOUT 100 PROCESSORS (MAX)
– CAPACITY, BANDWIDTH LIMITATIONS
SNOOPY CACHE-COHERENCE PROTOCOLS

State
Address P0 Pn
Data

$ bus $
snoop memory bus
memory op from Pn
Mem Mem

• MEMORY BUS IS A BROADCAST MEDIUM

• CACHES CONTAIN INFORMATION ON WHICH ADDRESSES THEY STORE
• CACHE CONTROLLER “SNOOPS” ALL TRANSACTIONS ON THE BUS
– A TRANSACTION IS A RELEVANT TRANSACTION IF IT INVOLVES A CACHE BLOCK CURRENTLY
CONTAINED IN THIS CACHE
– TAKE ACTION TO ENSURE COHERENCE
– INVALIDATE, UPDATE, OR SUPPLY VALUE
– MANY POSSIBLE DESIGNS (SEE CS252 OR CS258)
INTUITIVE MEMORY MODEL
• READING AN ADDRESS SHOULD RETURN THE LAST VALUE WRITTEN TO THAT ADDRESS
• EASY IN UNIPROCESSORS
– EXCEPT FOR I/O

• CACHE COHERENCE PROBLEM IN MPS IS MORE PERVASIVE AND MORE PERFORMANCE

CRITICAL

• MORE FORMALLY, THIS IS CALLED SEQUENTIAL CONSISTENCY:

“A MULTIPROCESSOR IS SEQUENTIALLY CONSISTENT IF THE RESULT OF ANY EXECUTION
IS THE SAME AS IF THE OPERATIONS OF ALL THE PROCESSORS WERE EXECUTED IN
SOME SEQUENTIAL ORDER, AND THE OPERATIONS OF EACH INDIVIDUAL PROCESSOR
APPEAR IN THIS SEQUENCE IN THE ORDER SPECIFIED BY ITS PROGRAM.” [LAMPORT,
1979]
#include <omp.h>
static long num_steps = 100000; double step;
#define NUM_THREADS 4
void main ()
{ int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{ int i, id,nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if (id == 0) nthreads = nthrds;
for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) {
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i=0, pi=0.0;i<nthreads;i++)pi += sum[i] * step;
}
PARALLEL PI PROGRAM (2013)
• Original Serial pi program with 100 million steps ran in 1.83 seconds*.

threads 1st
SPMD*
1 1.86
2 1.03
3 1.08
4 0.97

* Intel compiler (icpc) with default optimization level (O2) on Apple OS X 10.7.3 with a dual core (four HW thread) Intel®
CoreTM i5 processor at 1.7 Ghz and 4 Gbyte DDR3 memory at 1.333 Ghz.
PARALLEL PI PROGRAM (2020)
• Original Serial pi program with 1 billion steps ran in 0.985 seconds*.

threads 1st
SPMD*
1 1.102
2 0.512
4 0.280
8 0.276

*GCC with -O3 optimization on Apple macOS X 10.14.6 with a quad core (8 HW threads) 2.8 GHz Intel Core i7 and 16 Gbyte
LPDDR3 memory at 2.133 Ghz.
WHY SUCH POOR SCALING? FALSE SHARING
• IF INDEPENDENT DATA ELEMENTS HAPPEN TO SIT ON THE SAME CACHE LINE, EACH UPDATE WILL CAUSE THE
CACHE LINES TO “SLOSH BACK AND FORTH” BETWEEN THREADS … THIS IS CALLED “FALSE SHARING”.

HW thrd. 0 HW thrd. 1 HW thrd. 2 HW thrd. 3

L1 $ lines L1 $ lines

Sum[0] Sum[1] Sum[2] Sum[3] Sum[0] Sum[1] Sum[2] Sum[3]

Core 0 Core 1

Shared last level cache and connection to I/O and DRAM

• If you promote scalars to an array to support creation of an SPMD program, the array elements are
contiguous in memory and hence share cache lines … Results in poor scalability.
• Solution: Pad arrays so elements you use are on distinct cache lines.
EXAMPLE: ELIMINATE FALSE SHARING BY PADDING THE SUM ARRAY
#include <omp.h>
static long num_steps = 100000; double step;
#define PAD 8 // assume 64 byte L1 cache line size
#define NUM_THREADS 4
void main ()
{ int i, nthreads; double pi, sum[NUM_THREADS][PAD];
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel Pad the array so
{ int i, id,nthrds; each sum value is
double x;
in a different cache
id = omp_get_thread_num();
line
nthrds = omp_get_num_threads();
if (id == 0) nthreads = nthrds;
for (i=id, sum[id][0]=0.0;i< num_steps; i=i+nthrds) {
x = (i+0.5)*step;
sum[id][0] += 4.0/(1.0+x*x);
}
}
for(i=0, pi=0.0;i<nthreads;i++)pi += sum[i][0] * step;
}
RESULTS*: PI PROGRAM PADDED ACCUMULATOR (2013)
• Original Serial pi program with 100000000 steps ran in 1.83 seconds.

threads 1st SPMD 1st SPMD

padded
1 1.86 1.86
2 1.03 1.01
3 1.08 0.69
4 0.97 0.53

*Intel compiler (icpc) with default optimization level (O2) on Apple OS X 10.7.3 with a dual core (four
HW thread) Intel® CoreTM i5 processor at 1.7 Ghz and 4 Gbyte DDR3 memory at 1.333 Ghz.
RESULTS*: PI PROGRAM PADDED ACCUMULATOR (2020)
• Original Serial pi program with 1 billion steps ran in 0.985 seconds*.

threads 1st SPMD 1st SPMD

padded
1 1.102 0.987
2 0.512 0.496
4 0.280 0.271
8 0.276 0.268

*GCC with -O3 optimization on Apple macOS X 10.14.6 with a quad core (8 HW threads) 2.8 GHz Intel Core i7 and 16 Gbyte
LPDDR3 memory at 2.133 Ghz.

Bcs702 Parallel Computing Module 1
100% (1)
Bcs702 Parallel Computing Module 1
35 pages
Parallel Programming Unit 2
No ratings yet
Parallel Programming Unit 2
71 pages
Module 2
No ratings yet
Module 2
124 pages
Chapter Three Parallel Computing
No ratings yet
Chapter Three Parallel Computing
44 pages
Pthreads Programming Lecture Notes
No ratings yet
Pthreads Programming Lecture Notes
34 pages
Parallel Processing
No ratings yet
Parallel Processing
31 pages
UI/UX Design Essentials Guide
100% (2)
UI/UX Design Essentials Guide
20 pages
Programming Shared Address Space Platforms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Programming Shared Address Space Platforms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
67 pages
Shared-Memory Programming Guide
No ratings yet
Shared-Memory Programming Guide
33 pages
08 Systems Programming-Concurrent Programming
No ratings yet
08 Systems Programming-Concurrent Programming
61 pages
6-Posix Threads
No ratings yet
6-Posix Threads
32 pages
High Performance Computing
No ratings yet
High Performance Computing
67 pages
Threads
No ratings yet
Threads
38 pages
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
No ratings yet
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
22 pages
3 ParallelProgrammingModels
No ratings yet
3 ParallelProgrammingModels
20 pages
MAP - Unit2
No ratings yet
MAP - Unit2
134 pages
Concurrent Programming With Threads: Rajkumar Buyya
No ratings yet
Concurrent Programming With Threads: Rajkumar Buyya
168 pages
Pthreads Mod
No ratings yet
Pthreads Mod
110 pages
POSIX
No ratings yet
POSIX
55 pages
Parallel Programming & Multithreading
No ratings yet
Parallel Programming & Multithreading
168 pages
Unit 4
No ratings yet
Unit 4
42 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
Pthreads for Parallel Computing
No ratings yet
Pthreads for Parallel Computing
51 pages
2 Parallel Computer Memory Architectures
No ratings yet
2 Parallel Computer Memory Architectures
26 pages
Concurrency: CS2403 Programming Languages
No ratings yet
Concurrency: CS2403 Programming Languages
44 pages
Recent Trends in Parallel Computing
No ratings yet
Recent Trends in Parallel Computing
12 pages
Pdf24 Merged
No ratings yet
Pdf24 Merged
54 pages
Multithreading Algorithms
No ratings yet
Multithreading Algorithms
36 pages
Overview of Parallel Computing: Shawn T. Brown
No ratings yet
Overview of Parallel Computing: Shawn T. Brown
46 pages
Open MP1
No ratings yet
Open MP1
15 pages
High Performance Computing
No ratings yet
High Performance Computing
17 pages
Parallel Random Access Machines
No ratings yet
Parallel Random Access Machines
5 pages
Multiprocessor Basics & Performance
No ratings yet
Multiprocessor Basics & Performance
52 pages
1st Ia Preparation
No ratings yet
1st Ia Preparation
15 pages
PPL Unit-4
No ratings yet
PPL Unit-4
9 pages
(OS) - Unit-2.2-2.5 Process Management
No ratings yet
(OS) - Unit-2.2-2.5 Process Management
72 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
Introduction To Parallel Programming: Center For Institutional Research Computing
No ratings yet
Introduction To Parallel Programming: Center For Institutional Research Computing
98 pages
PA Midsem
No ratings yet
PA Midsem
20 pages
OS Module 1 Slides-2
No ratings yet
OS Module 1 Slides-2
47 pages
Lecture 16
No ratings yet
Lecture 16
30 pages
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
Pthreads Programming
No ratings yet
Pthreads Programming
54 pages
NymiandEvidianGuide RFID V12 20250123
No ratings yet
NymiandEvidianGuide RFID V12 20250123
148 pages
Parallel Computer Architecture A Hardware-Software
No ratings yet
Parallel Computer Architecture A Hardware-Software
18 pages
2.2 DD2356 Threads
No ratings yet
2.2 DD2356 Threads
22 pages
Public Cloud Config Guide SD
No ratings yet
Public Cloud Config Guide SD
9 pages
What Is Parallel Computing
No ratings yet
What Is Parallel Computing
9 pages
CSCI 350 Ch. 4 - Threads and Concurrency: Mark Redekopp Michael Shindler & Ramesh Govindan
No ratings yet
CSCI 350 Ch. 4 - Threads and Concurrency: Mark Redekopp Michael Shindler & Ramesh Govindan
41 pages
Unit-5 Part1
No ratings yet
Unit-5 Part1
85 pages
Lect9 Pthread
No ratings yet
Lect9 Pthread
24 pages
Lecture 4
No ratings yet
Lecture 4
20 pages
Operating System 4
No ratings yet
Operating System 4
33 pages
Slides Taken From: Parallel Computing Platforms
No ratings yet
Slides Taken From: Parallel Computing Platforms
11 pages
Unit 1 - Part - 2
No ratings yet
Unit 1 - Part - 2
30 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
Threads: Multicore Programming Multithreading Models Thread Libraries Threading Issues Operating System Examples
No ratings yet
Threads: Multicore Programming Multithreading Models Thread Libraries Threading Issues Operating System Examples
22 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Slip Solution1 All Done
No ratings yet
Slip Solution1 All Done
42 pages
User Manual of Transformer Ratio Tester TRT300 (V1.2) - 2
No ratings yet
User Manual of Transformer Ratio Tester TRT300 (V1.2) - 2
12 pages
IPC Linux
No ratings yet
IPC Linux
58 pages
Grade-8-4 Lesson
No ratings yet
Grade-8-4 Lesson
31 pages
Salesforce Platform Developer I and II Certification Training Brochure
No ratings yet
Salesforce Platform Developer I and II Certification Training Brochure
15 pages
SUMMATIVE Exam On ETECH
100% (1)
SUMMATIVE Exam On ETECH
3 pages
M03 Describing Cisco HX Software Components
No ratings yet
M03 Describing Cisco HX Software Components
34 pages
Department of Collegiate and Technical Education: Computer Science and Engineering
No ratings yet
Department of Collegiate and Technical Education: Computer Science and Engineering
21 pages
Road Registration 2. Routine Inspection (Mandatory and Optional) 3. Feedback (For Departmental Users)
No ratings yet
Road Registration 2. Routine Inspection (Mandatory and Optional) 3. Feedback (For Departmental Users)
18 pages
Powerpoint Presentation On Java: Name - Devendra Kumar Mishra Roll No. - 1735110020
No ratings yet
Powerpoint Presentation On Java: Name - Devendra Kumar Mishra Roll No. - 1735110020
26 pages
VM Multiplex
No ratings yet
VM Multiplex
31 pages
An Internship Report On Front End Web Design - A Study On Nano It World
No ratings yet
An Internship Report On Front End Web Design - A Study On Nano It World
34 pages
Course Registration System Analysis
No ratings yet
Course Registration System Analysis
7 pages
EMC.E20-559.v2018-03-12.q94: Show Answer
No ratings yet
EMC.E20-559.v2018-03-12.q94: Show Answer
26 pages
Neuron XT Compressor Manual
No ratings yet
Neuron XT Compressor Manual
3 pages
ECOL203 Assignment 1: Life Table Analysis For A Small Wallaby
No ratings yet
ECOL203 Assignment 1: Life Table Analysis For A Small Wallaby
2 pages
Configurações QR Code Scanner V3.0
No ratings yet
Configurações QR Code Scanner V3.0
40 pages
Acknowledgement: Niharika Sharma XII Science
No ratings yet
Acknowledgement: Niharika Sharma XII Science
24 pages
1-Getting Started With WebRTC
No ratings yet
1-Getting Started With WebRTC
9 pages
MySQL Setup & Workbench Guide
No ratings yet
MySQL Setup & Workbench Guide
13 pages
Certifiction List Page-2
No ratings yet
Certifiction List Page-2
8 pages
Deploy and Manage Kubernetes Clusters in A Multicloud World
No ratings yet
Deploy and Manage Kubernetes Clusters in A Multicloud World
13 pages
Tutorial Solutions - Week6
No ratings yet
Tutorial Solutions - Week6
11 pages
It Final Exams Timetable 2024-2025
No ratings yet
It Final Exams Timetable 2024-2025
3 pages
Raghav Pathology E411 Haldwani 2
No ratings yet
Raghav Pathology E411 Haldwani 2
3 pages
7 Off Your Morrisons - Com Shop PDF
No ratings yet
7 Off Your Morrisons - Com Shop PDF
4 pages
Commotion Construction Kit: Networking
No ratings yet
Commotion Construction Kit: Networking
5 pages
ISO TS 10399-2-2005 Cor1-2011
No ratings yet
ISO TS 10399-2-2005 Cor1-2011
2 pages

PDC Lecture 05

Uploaded by

PDC Lecture 05

Uploaded by

CS-402 Parallel and Distributed Systems

 Quick Review Lecture No. 05

data space and exchanging messages.

Proc Proc Proc

• Where is the memory physically located?

• THREAD_ID IS THE THREAD ID OR HANDLE (USED TO HALT, ETC.)

• BUT OVERHEAD OF THREAD CREATION IS NONTRIVIAL

for i = 0, n/2-1 for i = n/2, n-1

• Problem is a race condition on variable s in the program

• RESEARCHERS LOOK AT TRANSACTIONAL MEMORY AN ALTERNATIVE

• OPENMP WILL NOT:

• THE COMPLEXITY HAS GROWN CONSIDERABLY OVER THE YEARS!

C/C++ spec 4.0

OpenMP 5.0 (November 2018) is actually 666 pages.

OpenMP pragma, function, or clause Concepts

#pragma omp for Worksharing, parallel loops, loop carried dependencies

schedule(dynamic [,chunk]) Loop schedules, loop overheads and load balance

nowait Disabling implied barriers on workshare constructs, the high cost of

OpenMP Runtime library

OS/system support for shared memory and threading

Proc1 Proc2 Proc3 ProcN

Shared Address Space

C and C++ Fortran

} !$OMP END PARALLEL

Function prototypes and types:

• Most OpenMP* constructs apply to a “structured block”.

• WRITE A PROGRAM THAT PRINTS “HELLO WORLD”.

printf(“ hello ”);

#include <omp.h> Switches for compiling and linking

printf(“ hello ”);

#include <omp.h> OpenMP include file

• YOU CREATE THREADS IN OPENMP* WITH THE PARALLEL CONSTRUCT.

double A[1000]; Runtime function to

● Each thread calls pooh(ID,A) for ID = 0 to 3

double A[1000]; Runtime function to

We can approximate the integral as a

Where each rectangle has width Δx and

static long num_steps = 100000;

step = 1.0/(double) num_steps;

for (i=0;i< num_steps; i++){

step = 1.0/(double) num_steps;

• PROCESSORS ALL CONNECTED TO A LARGE SHARED MEMORY

• Now take a closer look at structure, costs, limits, programming

Mem I/O devices

• WANT HIGH PERFORMANCE FOR SHARED MEMORY: USE CACHES!

• MEMORY BUS IS A BROADCAST MEDIUM

• CACHE COHERENCE PROBLEM IN MPS IS MORE PERVASIVE AND MORE PERFORMANCE

• MORE FORMALLY, THIS IS CALLED SEQUENTIAL CONSISTENCY:

HW thrd. 0 HW thrd. 1 HW thrd. 2 HW thrd. 3

Sum[0] Sum[1] Sum[2] Sum[3] Sum[0] Sum[1] Sum[2] Sum[3]

Shared last level cache and connection to I/O and DRAM

threads 1st SPMD 1st SPMD

threads 1st SPMD 1st SPMD

You might also like