Parallel and Distributed
Programming
Dr. Muhammad Naveed Akhtar
Lecture – 03
Parallel Software
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 2
Roadmap
• Parallel software
• Input and output
• Performance
• Parallel program design
• Writing and running parallel programs
• Assumptions
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 3
Parallel Software
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 4
The burden is on software
• Hardware and compilers can keep up the pace needed.
• From now on…
• In shared memory programs: if (I’m thread process i)
• Start a single process and fork threads.
do this;
• Threads carry out tasks.
• In distributed memory programs:
else
• Start multiple processes. do that;
• Processes carry out tasks.
• A SPMD programs consists of a single executable that can behave as if it were multiple different
programs through the use of conditional branches.
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 5
Writing Parallel Programs
• Divide the work among the processes/threads so that:
• each process/thread gets roughly the same amount of work
• communication is minimized.
• Arrange for the processes/threads to synchronize.
• Arrange for communication among processes/threads.
double x[n], y[n];
…
for (i = 0; i < n; i++)
x[i] += y[i];
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 6
Shared Memory
• Dynamic threads
• Master thread waits for work, forks new threads, and when threads are done, they terminate
• Efficient use of resources, but thread creation and termination is time consuming.
• Static threads
• Pool of threads created and are allocated work, but do not terminate until cleanup.
• Better performance, but potential waste of system resources.
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 7
Nondeterminism
• When some information may not be known before the start of execution of a program.
• Scheduling nondeterministic task arises in loops and conditional branching.
• Scheduling nondeterministic programs can be achieved dynamically. However, dynamic scheduling
consumes time and resources which leads to overhead during program execution.
printf ( "Thread %d > my_val = %d\n", my_rank , my_x ) ;
my_val = Compute_val(my_rank);
x += my_val ;
Thread 0 > my_val = 7
Thread 1 > my_val = 19
Thread 1 > my_val = 19
Thread 0 > my_val = 7
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 8
Nondeterminism
Race Condition Critical Section
When two or more threads access the same • The overlapping portion of each process,
resource at the same time where the shared variables are being used.
• Mutually exclusive (One at a time)
• Mutual exclusion lock (Mutex, or Simply Lock)
Thread 1 Thread 2 Balance
Time
Withdraw $50 Withdraw $50 $125
my_val = Compute_val ( my_rank ) ;
Read Balance $125
Read Balance $125 Lock(&add_my_val_lock ) ;
Set Balance $75 x += my_val ;
Set Balance $75 Unlock(&add_my_val_lock ) ;
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 9
Busy – Waiting
• A thread repeatedly tests a condition, but, effectively, does no useful work until the condition has
the appropriate value
my_val = Compute_val ( my_rank ) ;
if ( my_rank == 1)
while ( ! ok_for_1 ) ; /* Busy−wait loop */
x += my_val ; /* Critical section */
if ( my_rank == 0)
ok_for_1 = true ; /* Let thread 1 update x */
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 10
Message – Passing
char message [ 100 ] ;
. . .
my_rank = Get_rank ( ) ;
if ( my_rank == 1) {
sprintf ( message , "Greetings from process 1" ) ;
Send ( message , MSG_CHAR , 100 , 0 ) ;
} else if ( my_rank == 0) {
Receive ( message , MSG_CHAR , 100 , 1 ) ;
printf ( "Process 0 > Received: %s\n" , message ) ;
}
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 11
Partitioned Global Address Space (PGAS)
• Explicitly-parallel programming model with SPMD parallelism
• Fixed at program start-up, typically 1 thread per processor
• Global address space model of memory
• Allows programmer to directly represent distributed data structures
• Address space is logically partitioned
• Local vs. remote memory (two-level hierarchy)
• Programmer control over performance critical decisions
• Data layout and communication
• Performance transparency and tunability are goals
• Initial implementation can use fine-grained shared memory
• Multiple PGAS languages: UPC (C), CAF (Fortran), Titanium (Java)
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 12
Global Address Space Eases Programming
• The languages share the global address space abstraction
• Shared memory is logically partitioned by processors
• Remote memory may stay remote: no automatic caching implied
• One-sided communication: reads/writes of shared variables
• Both individual and bulk memory copies
• Languages differ on details
• Some models have a separate private memory area
• Distributed array generality and how they are constructed
Thread - 0 Thread - 1 Thread - N
Global X[0] X[1] X[P] Shared
Address Space ptr: ptr: … … … … … ptr: Private
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 13
Partitioned Global Address Space Example
shared int n = ... ;
shared double x[n] , y[n] ;
private int i , my_first_element , my_last_element ;
my_first_element = ... ;
my_last_element = ... ;
/ * Initialize x and y */
...
for ( i = my_first_element ; i <= my_last_element ; i++)
x [i] += y [i] ;
Current Implementations of PGAS Languages
• A successful language/library must run everywhere
• UPC (Unified Parallel C)
• Commercial compilers available on Cray, SGI, HP machines
• Open source compiler from LBNL/UCB (source-to-source), gcc-based compiler
• CAF (Co-Array Fortran)
• Commercial compiler available on Cray machines
• Open source compiler available from Rice
• Titanium
• Open source compiler from UCB runs on most machines
• Common tools
• Open64 open source research compiler infrastructure
• ARMCI, GASNet for distributed memory implementations
• Pthreads, System V shared memory
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 15
Input and Output
• In distributed memory programs, only process 0 will access stdin. In shared memory programs, only
the master thread or thread 0 will access stdin.
• In both distributed memory and shared memory programs all the processes/threads can access
stdout and stderr.
• However, because of the indeterminacy of the order of output to stdout, in most cases only a single
process/thread will be used for all output to stdout other than debugging output.
• Debug output should always include the rank or id of the process/thread that’s generating the
output.
• Only a single process/thread will attempt to access any single file other than stdin, stdout, or
stderr. So, for example, each process/thread can open its own, private file for reading or writing,
but no two processes/threads will open the same file.
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 16
Performance
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 17
Speedup
• Serial run-time = Tserial
• Number of cores = p
• Parallel run-time = Tparallel
Parallel Time (Ideal) Serial Program Perfect Parallelization Perfect Load Balancing
𝑇𝑇𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 100 100
𝑇𝑇𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 𝑆𝑆 =
25
= 4.0 𝑆𝑆 =
35
= 2.85
𝑝𝑝
Speedup of a parallel program
• Speedup = S 𝑆𝑆 =
100
= 2.5 𝑆𝑆 =
100
= 2.0
𝑇𝑇𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 40 50
𝑆𝑆 =
𝑇𝑇𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
Load Imbalance Load Imbalance & Sync
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) Close to real life problems 18
Efficiency of Parallel Program
𝑇𝑇𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑆𝑆 𝑇𝑇𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑇𝑇𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝐸𝐸 = = =
𝑝𝑝 𝑝𝑝 𝑝𝑝. 𝑇𝑇𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 19
Effect of Problem Size
𝑁𝑁
Items
2
𝑁𝑁 Items
𝑁𝑁 × 2 Items
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 20
Effect of Overhead
𝑇𝑇𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑇𝑇𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = Ideally Practically 𝑝𝑝. 𝑇𝑇𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 > 𝑇𝑇𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑝𝑝
𝑇𝑇𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑝𝑝. 𝑇𝑇𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 𝑇𝑇𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑇𝑇𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = + 𝑇𝑇𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜
𝑝𝑝
Amdahl’s Law
• Unless virtually all of a serial program is parallelized, the possible speedup is going to be very limited
— regardless of the number of cores available.
p processors
serial section parallelizable section 𝑟𝑟. 𝑇𝑇𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
one processor 1 − 𝑟𝑟 . 𝑇𝑇𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑟𝑟. 𝑇𝑇𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 1 − 𝑟𝑟 . 𝑇𝑇𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑝𝑝
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 21
Consequence of Amdahl’s Law
• For a given instance, adding additional Example
processors gives diminishing returns • Assume 90% of a serial program is perfectly parallel
• only relatively few processors can be • Tserial = 20 seconds
efficiently used • Runtime of parallelizable part is
• Way around: 𝑇𝑇𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝−𝑎𝑎 = 1 − 𝑟𝑟 . 𝑇𝑇𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 /𝑝𝑝 = 0.9 × 20/𝑝𝑝=18/p
• increase the problem size • Runtime of “un-parallelizable” part is
• sequential part tends to grow slower 𝑇𝑇𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝−𝑏𝑏 = 𝑟𝑟 . 𝑇𝑇𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 0.1 × 20=2
then the parallel part
• Overall parallel run-time is
• A system is scalable if efficiency can be 𝑇𝑇𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 𝑇𝑇𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝−𝑎𝑎 + 𝑇𝑇𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝−𝑏𝑏 = (18/p)+2
maintained by increasing problem size
• Speedup
𝑇𝑇𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 20
𝑆𝑆 = =
𝑇𝑇𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 (18/p)+2
𝑝𝑝 → ∞, 𝑆𝑆 → 10
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 22
Sources of Parallel Overheads
• Overhead of creating threads/processes
• Synchronization
• Load imbalance
• Communication
• Extra computation
• Memory access (for both sequential and parallel!)
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 23
Scalability
• In general, a problem is scalable if it can handle ever increasing problem sizes.
• If we increase the number of processes/threads and keep the efficiency fixed without increasing
problem size, the problem is strongly scalable.
• If we keep the efficiency fixed by increasing the problem size at the same rate as we increase the
number of processes/threads, the problem is weakly scalable.
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 24
Taking Timings (How to measure time)
• What is time?
MPI_Wtime()
• Start to finish? omp_get_wtime()
• A program segment of interest?
• CPU time?
• Wall clock time?
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 25
Taking Timings (Distributed Memory Systems)
$ gcc -o tmeasure tmeasure.c -lm
$ ./tmeasure
clock resolution: 1000000
res: 1.000000e+00
start/stop: 0.000000e+00,8.730000e+00
Time: 8.730000e+00
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 26
3 Different Type of Times
• Wall Time
• Time span a “clock on the wall” would measure, Time elapsed between start and completion of the program.
• This is usually the time to be minimized.
• User Time
• The actual runtime used by the program.
• 1 CPU: User time << the wall time, Multiple CPUs: User time > the wall time
• Program has to wait a lot (for computation time allocation, for data from the RAM or from the hard-disk).
• These are indications for necessary optimizations.
• System Time
• Time used not by the program itself, but by the operating system, e.g. for allocating memory or hard disk access.
• System time should stay low.
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 27
Measuring Program Runtime
• LINUX command time: time ls -l
• For the performance analysis, we want to know the runtime of individual parts of a program.
• MPI & OpenMP have their own, platform independent functions for time measurement.
• MPI_Wtime() & omp_get_wtime() return the wall time in secs, the difference between the results of two such
function calls yields the runtime elapsed between the two function calls.
• Advanced method of performance analysis: profiling (various tools: gprof, Jumpshot, PMPI, Vampir)
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 28
Using gprof (GNU Profiler) program
• Compile my_program.c
gcc -pg my_program.c
./a.out
gprof a.out
Assignment
my_program.c (will be available on canvas)
What this program doing?
Translate the output in words.
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 29
Profiling Tools for MPI
Jumpshot PMPI Vampir
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 30
Parallel Program Design
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 31
Foster’s methodology
Partitioning
• Divide the computation to be performed and the data operated on by the computation into small tasks.
• Identifying tasks that can be executed in parallel
Communication
• Determine what communication needs to be carried out among the tasks identified in the previous step
Agglomeration or aggregation
• Combine tasks and communications identified in the first step into larger tasks.
Mapping
• Assign the composite tasks identified in the previous step to processes/threads
• Each process/thread gets roughly the same amount of work
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 32
Example - Histogram
1.3, 2.9, • Input
0.4, 0.3, • The number of measurements: data_count
1.3, 4.4, • An array of length data_count floats: data
• The minimum value for the bin containing the smallest values: min_meas
1.7, 0.4,
• The maximum value for the bin containing the largest values: max_meas
3.2, 0.3,
• The number of bins: bin_count
4.9, 2.4,
• Output
3.1, 4.4, • bin_maxes : an array of bin_count floats
3.9, 0.4, • bin_counts : an array of bin_count ints
4.2, 4.5,
4.9, 0.9
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 33
First two stages of Foster’s Methodology
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 34
Alternative definition of tasks and communication
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 35
Adding the local arrays
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 36
Concluding Remarks
• Serial systems • Input and Output
• The standard model of computer hardware • One process / thread can access stdin, and all
has been the von Neumann architecture. processes can access stdout and stderr.
• Parallel hardware • However, except for debug output we usually have
a single process / thread accessing stdout.
• Flynn’s taxonomy.
• Performance
• Parallel software
• Speedup, Efficiency
• We focus on software for homogeneous MIMD
systems, consisting of a single program that • Amdahl’s law, Scalability
obtains parallelism by branching. • Parallel Program Design
• SPMD programs. • Foster’s methodology
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 37
Questions and comments?
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 38