Shared Memory Programming
with Pthreads
# Chapter Subtitle
Outline
• Shared memory programming: Overview
• POSIX pthreads
• Critical section & thread synchronization.
Mutexes.
Producer-consumer synchronization and
semaphores.
Barriers and condition variables.
Read-write locks.
• Thread safety.
Shared Memory Architecture
Processes and Threads
• A process is an instance of a running (or
suspended) program.
• Threads are analogous to a “light-weight” process.
• In a shared memory program a single process may
have multiple threads of control.
Logical View of Threads
• Threads are created within a process
A process Process hierarchy
T2
T4
T1
shared code, data P1
and kernel context
sh sh sh
T5 T3
foo
Concurrent Thread Execution
• Two threads run concurrently if their logical flows
overlap in time
• Examples:
Concurrent: Thread A Thread B Thread C
A & B, A&C
B&C
Time
Execution Flow on one-core or multi-core
systems
Concurrent execution on a single core system
Parallel execution on a multi-core system
Benefits of multi-threading
• Responsiveness
• Resource Sharing
Shared memory
• Economy
• Scalability
Explore multi-core CPUs
Thread Programming with Shared Memory
• Program is a collection of threads of control.
Can be created dynamically
• Each thread has a set of private variables, e.g., local stack
variables
• Also a set of shared variables, e.g., static variables, shared
common blocks, or global heap.
Threads communicate implicitly by writing and reading
shared variables.
Threads coordinate by synchronizing on shared
variables
Shared memory
s
s = ...
i: 2 i: 5 Private i: 8
memory 9
P0 P1 Pn
Shared Memory
Programming
Several Thread Libraries/systems
• Pthreads is the POSIX Standard
Relatively low level
Portable but possibly slow; relatively heavyweight
• OpenMP standard for application level programming
Support for scientific programming on shared memory
http://www.openMP.org
• Java Threads
• TBB: Thread Building Blocks
Intel
• CILK: Language of the C “ilk”
Lightweight threads embedded into C
10
Creation of Unix processes vs. Pthreads
C function for starting a thread
pthread.h
One object
for each
pthread_t thread.
int pthread_create (
pthread_t* thread_p /* out */ ,
const pthread_attr_t* attr_p /* in */ ,
void* (*start_routine ) ( void ) /* in */ ,
void* arg_p /* in */ ) ;
A closer look (1)
int pthread_create (
pthread_t* thread_p /* out */ ,
const pthread_attr_t* attr_p /* in */ ,
void* (*start_routine ) ( void ) /* in */ ,
void* arg_p /* in */ ) ;
We won’t be using, so we just pass NULL.
Allocate before calling.
A closer look (2)
int pthread_create (
pthread_t* thread_p /* out */ ,
const pthread_attr_t* attr_p /* in */ ,
void* (*start_routine ) ( void ) /* in */ ,
void* arg_p /* in */ ) ;
Pointer to the argument that should
be passed to the function start_routine.
The function that the thread is to run.
Function started by pthread_create
• Prototype:
void* thread_function ( void* args_p ) ;
• Void* can be cast to any pointer type in C.
• So args_p can point to a list containing one or
more values needed by thread_function.
• Similarly, the return value of thread_function can
point to a list of one or more values.
Wait for Completion of Threads
pthread_join(pthread_t *thread, void
**result);
Wait for specified thread to finish. Place exit value
into *result.
• We call the function pthread_join once for each
thread.
• A single call to pthread_join will wait for the thread
associated with the pthread_t object to complete.
Example of Pthreads
#include <pthread.h>
#include <stdio.h>
void *PrintHello(void * id){
printf(“Thread%d: Hello World!\n", id);
}
void main (){
pthread_t thread0, thread1;
pthread_create(&thread0, NULL, PrintHello, (void *) 0);
pthread_create(&thread1, NULL, PrintHello, (void *) 1);
}
Example of Pthreads with join
#include <pthread.h>
#include <stdio.h>
void *PrintHello(void * id){
printf(“Hello from thread %d\n", id);
}
void main (){
pthread_t thread0, thread1;
pthread_create(&thread0, NULL, PrintHello, (void *) 0);
pthread_create(&thread1, NULL, PrintHello, (void *) 1);
pthread_join(thread0, NULL);
pthread_join(thread1, NULL);
}
Some More Pthread Functions
• pthread_yield();
Informs the scheduler that the thread is willing to yield
• pthread_exit(void *value);
Exit thread and pass value to joining thread (if exists)
Others:
• pthread_t me; me = pthread_self();
Allows a pthread to obtain its own identifier pthread_t
thread;
• Synchronizing access to shared variables
pthread_mutex_init, pthread_mutex_[un]lock
pthread_cond_init, pthread_cond_[timed]wait
Compiling a Pthread program
gcc −g −Wall −o pth_hello pth_hello . c −lpthread
link in the Pthreads library
Running a Pthreads program
. / pth_hello
Hello from thread 1
Hello from thread 0
. / pth_hello
Hello from thread 0
Hello from thread 1
Difference between Single and Multithreaded
Processes
Shared memory access for code/data
Separate control flow -> separate stack/registers
CRITICAL SECTIONS
Data Race Example
static int s = 0;
Thread 0 Thread 1
for i = 0, n/2-1 for i = n/2, n-1
s = s + f(A[i]) s = s + f(A[i])
• Also called critical section problem.
• A race condition or data race occurs when:
- two processors (or two threads) access the same variable,
and at least one does a write.
- The accesses are concurrent (not synchronized) so they
could happen simultaneously
Synchronization Solutions
1. Busy waiting
2. Mutex (lock)
3. Semaphore
4. Conditional Variables
Example of Busy Waiting
static int s = 0;
static int flag=0
Thread 0 Thread 1
int temp, my_rank int temp, my_rank
for i = 0, n/2-1 for i = n/2, n-1
temp0=f(A[i]) temp=f(A[i])
while flag!=my_rank; while flag!=my_rank;
s = s + temp0 s = s + temp
flag= (flag+1) %2 flag= (flag+1) %2
• A thread repeatedly tests a condition, but, effectively, does no
useful work until the condition has the appropriate value.
• Weakness: Waste CPU resource. Sometime not safe with
compiler optimization.
Mutexes (Locks)
Acquire mutex lock
Critical section
• Code structure
Unlock/Release mutex
• Mutex (mutual exclusion) is a special type of variable
used to restrict access to a critical section to a single
thread at a time.
• guarantee that one thread “excludes” all other threads
while it executes the critical section.
• When A thread waits on a mutex/lock,
CPU resource can be used by others.
• Only thread that has acquired the lock
can release this lock
Execution example with 2 threads
Thread 1 Thread 2
Acquire mutex lock Acquire mutex lock
Critical section
Unlock/Release mutex
Critical section
Unlock/Release mutex
Mutexes in Pthreads
• A special type for mutexes: pthread_mutex_t.
• To gain access to a critical section, call
• To release
• When finishing use of a mutex, call
Global sum function that uses a mutex (1)
Global sum function that uses a mutex (2)
Semaphore: Generalization from mutex
locks
• Semaphore S – integer variable
• Can only be accessed /modified via two
(atomic) operations with the following
semantics:
wait (S) { //also called P()
while S <= 0 wait in a queue;
S--;
}
post(S) { //also called V()
S++;
Wake up a thread that waits in the queue.
}
Why Semaphores?
Synchronization Functionality/weakness
Busy waiting Spinning for a condition. Waste
resource. Not safe
Mutex lock Support code with simple mutual
exclusion
Semaphore Handle more complex signal-based
synchronization
• Examples of complex synchronization
Allow a resource to be shared among multiple
threads.
– Mutex: no more than 1 thread for one protected region.
Allow a thread waiting for a condition after a signal
– E.g. Control the access order of threads entering the
critical section.
– For mutexes, the order is left to chance and the system.
Syntax of Pthread semaphore functions
Semaphores are not part of Pthreads;
you need to add this.
Producer-consumer
Synchronization and
Semaphores
Producer-Consumer Example
T0 T1 T2
• Thread x produces a message for Thread x+1.
Last thread produces a message for thread 0.
• Each thread prints a message sent from its source.
• Will there be null messages printed?
A consumer thread prints its source message before
this message is produced.
How to avoid that?
Flag-based Synchronization with 3 threads
Thread 0
Thread 1 Thread 2
Write a msg to #1 Write a msg to #2
Write a msg to #0
Set msg[1]
Set msg[2] Set msg[0]
If msg[0] is ready If msg[1] is ready If msg[2] is ready
Print msg[0] Print msg[1]
Print msg[2]
To make sure a message is received/printed, use busy waiting.
First attempt at sending messages using pthreads
Produce a message for a destination
thread
Consume a message
Semaphore Synchronization with 3 threads
Thread 0
Thread 1 Thread 2
Write a msg to #1 Write a msg to #2
Write a msg to #0
Set msg[1]
Set msg[2] Set msg[0]
Post(semp[1])
Post(semp[2]) Post(semp[0])
Wait(semp[0])
Print msg[0] Wait(semp[1])
Wait(semp[2])
Print msg[1]
Print msg[2]
Message sending with semaphores
sprintf(my_msg, "Hello to %ld from %ld", dest, my_rank);
messages[dest] = my_msg;
sem_post(&semaphores[dest]);
/* signal the dest thread*/
sem_wait(&semaphores[my_rank]);
/* Wait until the source message is created */
printf("Thread %ld > %s\n", my_rank,
messages[my_rank]);
READERS-WRITERS PROBLEM
Synchronization Example for Readers-Writers Problem
• A data set is shared among a number of concurrent
threads.
Readers – only read the data set; they do not perform any
updates
Writers – can both read and write
• Requirement:
allow multiple readers to read at the same time.
Only one writer can access the shared data at the same
time.
• Reader/writer access permission table:
Reader Writer
Reader OK No
Writer NO No
Readers-Writers (First try with 1 mutex lock)
• writer
do {
mutex_lock(w);
// writing is performed
mutex_unlock(w);
Reader Writer
} while (TRUE);
• Reader Reader ? ?
do { Writer ? ?
mutex_lock(w);
// reading is performed
mutex_unlock(w);
} while (TRUE);
Readers-Writers (First try with 1 mutex lock)
• writer
do {
mutex_lock(w);
// writing is performed
mutex_unlock(w);
Reader Writer
} while (TRUE);
• Reader Reader no no
do { Writer no no
mutex_lock(w);
// reading is performed
mutex_unlock(w);
} while (TRUE);
2nd try using a lock + readcount
• writer
do {
mutex_lock(w);// Use writer mutex lock
// writing is performed
mutex_unlock(w);
} while (TRUE);
• Reader
do {
readcount++; // add a reader counter.
if(readcount==1) mutex_lock(w);
// reading is performed
readcount--;
if(readcount==0) mutex_unlock(w);
} while (TRUE);
Readers-Writers Problem with semaphone
• Shared Data
Data set
Lock mutex (to protect readcount)
Semaphore wrt initialized to 1 (to
synchronize between
readers/writers)
Integer readcount initialized to 0
Readers-Writers Problem
• A writer
do {
sem_wait(wrt) ; //semaphore wrt
// writing is performed
sem_post(wrt) ; //
} while (TRUE);
Readers-Writers Problem (Cont.)
• Reader
do {
mutex_lock(mutex);
readcount ++ ;
if (readcount == 1)
sem_wait(wrt); //check if anybody is
writing
mutex_unlock(mutex)
// reading is performed
mutex_lock(mutex);
readcount - - ;
if (readcount == 0)
sem_post(wrt) ; //writing is allowed now
nlock(mutex) ;
} while (TRUE);
Barriers
• Synchronizing the threads to make sure that they all
are at the same point in a program is called a barrier.
• No thread can cross the barrier until all the threads
have reached it.
• Availability:
No barrier provided by
Pthreads library and needs
a custom implementation
Barrier is implicit in
OpenMP
and available in MPI.
Condition Variables
• Why?
• More programming primitives to simplify code for
synchronization of threads
Synchronization Functionality
Busy waiting Spinning for a condition. Waste resource.
Not safe
Mutex lock Support code with simple mutual
exclusion
Semaphore Signal-based synchronization. Allow
sharing (not wait unless semaphore=0)
Barrier Rendezvous-based synchronization
Condition More complex synchronization: Let
variables threads wait until a user-defined
condition becomes true
Synchronization Primitive: Condition Variables
• Used together with a lock
• One can specify more general waiting
condition compared to semaphores.
• A thread is blocked when condition is no
true:
placed in a waiting queue, yielding
CPU resource to somebody else.
Wake up until receiving a signal
Pthread synchronization: Condition variables
int status; pthread_condition_t cond;
const pthread_condattr_t attr;
pthread_mutex mutex;
status = pthread_cond_init(&cond,&attr);
status = pthread_cond_destroy(&cond);
status = pthread_cond_wait(&cond,&mutex);
-wait in a queue until somebody wakes up. Then the mutex is
reacquired.
status = pthread_cond_signal(&cond);
- wake up one waiting thread.
status = pthread_cond_broadcast(&cond);
- wake up all waiting threads in that condition
How to Use Condition Variables: Typical
Flow
Thread 1: //try to get into critical section and
wait for the condition
Mutex_lock(mutex);
While (condition is not satisfied)
Cond_Wait(mutex, cond);
Critical Section;
Mutex_unlock(mutex)
Thread 2: // Try to create the condition.
Mutex_lock(mutex);
When condition can satisfy, Signal(cond);
Mutex_unlock(mutex);
Condition variables for in producer-
consumer problem with unbounded
buffer
Producer deposits data in a buffer for others to consume
First version for consumer-producer problem
with unbounded buffer
• int avail=0; // # of data items available for consumption
• Consumer thread:
while (avail <=0); //wait
Consume next item; avail = avail-1;
Producer thread:
Produce next item; avail = avail+1;
//notify an item is available
Condition Variables for consumer-producer
problem with unbounded buffer
• int avail=0; // # of data items available for consumption
• Pthread mutex m and condition cond;
• Consumer thread:
multex_lock(&m)
while (avail <=0) Cond_Wait(&cond, &m);
Consume next item; avail = avail-1;
mutex_unlock(&mutex)
Producer thread:
mutex_lock(&m);
Produce next item; availl = avail+1;
Cond_signal(&cond); //notify an item is
available
mutex_unlock(&m);
When to use condition broadcast?
• When waking up one thread to run
is not sufficient.
• Example: concurrent malloc()/free()
for allocation and deallocation of
objects with non-uniform sizes.
Running trace of malloc()/free()
• Initially 10 bytes are free.
• m() stands for malloc(). f() for free()
Thread 1: Thread 2: Thread 3:
m(10) – succ m(5) – wait m(5) – wait
f(10) –broadcast
Resume m(5)-succ
Resume m(5)-succ
m(7) – wait
m(3) –wait
f(5) –broadcast
Resume m(7)-wait Resume m(3)-succ
Time
Issues with Threads: False Sharing,
Deadlocks, Thread-safety
Problem: False Sharing
• Occurs when two or more processors/cores access
different data in same cache line, and at least one
of them writes.
Leads to ping-pong effect.
• Let’s assume we parallelize code with p=2:
for( i=0; i<n; i++ )
a[i] = b[i];
Each array element takes 8 bytes
Cache line has 64 bytes (8 numbers)
False Sharing: Example (2 of 3)
Execute this program in two processors
for( i=0; i<n; i++ )
a[i] = b[i];
cache line
a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7]
Written by CPU 0
Written by CPU 1
False Sharing: Example Two CPUs execute:
for( i=0; i<n; i++ )
a[i] = b[i];
a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7]
cache line
Written by CPU 0
Written by CPU 1
a[0] a[2] a[4] CPU0
inv data ...
a[1] a[3] a[5] CPU1
Matrix-Vector Multiplication with
Pthreads
Parallel programming book by Pacheco book P.159-162
Sequential code
Block Mapping for Matrix-Vector Multiplication
• Task partitioning
For (i=0; i<m; i=i+1)
Task Si for Row i
y[i]=0;
For (j=0; j<n; j=j+1)
y[i]=y[i] +a[i][j]*x[j]
Task graph
S0 S1 Sm
...
Mapping to
threads S0 S1 S2 S3
...
Thread 0 Thread 1
Using 3 Pthreads for 6 Rows: 2 row per
thread
S0, S1
S2, S3
S4,S5
Code for S0
Code for Si
Pthread code for thread with ID rank
i-th thread calls Pth_mat_vect( &i)
m is # of rows in this matrix A.
n is # of columns in this matrix A.
local_m is # of rows handled by
this thread.
Task Si
Impact of false sharing on performance of
matrix-vector multiplication
(times are in seconds)
Why is performance of
8x8,000,000 matrix bad?
How to fix that?
Deadlock and Starvation
• Deadlock – two or more threads are waiting
indefinitely for an event that can be only caused by
one of these waiting threads
• Starvation – indefinite blocking (in a waiting queue
forever).
Let S and Q be two mutex locks:
P0 P1
Lock(S); Lock(Q);
Lock(Q); Lock(S);
. .
. .
. .
Unlock(Q); Unlock(S);
Unlock(S); Unlock(Q);
Deadlock Avoidance
• Order the locks and always acquire the locks in
that order.
• Eliminate circular waiting
:
P0 P1
Lock(S); Lock(S);
Lock(Q); Lock(Q);
. .
. .
. .
Unlock(Q);
Unlock(Q);
Unlock(S);
Unlock(S);
Thread-Safety
• A block of code is thread-safe if it can be
simultaneously executed by multiple threads without
causing problems.
• When you program your own functions, you know if
they are safe to be called by multiple threads or not.
• You may forget to check if system library functions
used are thread-safe.
Unsafe function: strtok()from C string.h library
Other example.
– The random number generator random in stdlib.h.
– The time conversion function localtime in time.h.
Concluding Remarks
• A thread in shared-memory programming is analogous
to a process in distributed memory programming.
However, a thread is often lighter-weight than a full-
fledged process.
• When multiple threads access a shared resource
without controling, it may result in an error: we have a
race condition.
A critical section is a block of code that updates a
shared resource that can only be updated by one
thread at a time
Mutex, semaphore, condition variables
• Issues: false sharing, deadlock, thread safety