From C/C++11 to POWER and ARM:
What is Shared-Memory Concurrency, Anyway?
Susmit Sarkar
University of St Andrews
MMnet, Heriot Watt
May, 2013
Shared Memory Concurrency: Since 1962
Burroughs D825
(first multiprocessing computer)
Outstanding features include truly modular hardware with
parallel processing throughout.
FUTURE PLANS
The complement of compiling languages is to be expanded.
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
2 / 34
And Since 2011: In C/C++
ISO C/C++11: introduces a new concurrency model
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
3 / 34
Example: Message Passing
Initially:
Thread 0
d = 1;
f = 1;
d = 0; f = 0;
Thread 1
while (f == 0)
{};
r = d;
Finally: r = 0 ??
Programmer would hope this is Forbidden
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
4 / 34
Example: Message Passing (racy)
Initially:
d = 0; f = 0;
Thread 0
d = 1;
f = 1;
Thread 1
while (f == 0)
{};
r = d;
Finally: r = 0 ??
Programmer would hope this is Forbidden
In C/C++11, this has undefined semantics
Data race on d and f variables
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
4 / 34
C11: A Data Race Free Model
Idea: Programmer mistake to write Data Races
Basis of C11 Concurrency
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
5 / 34
Example (contd.): mark atomics
Mark atomic variables (accesses have memory order parameter)
Initially:
atomic d = 0; f = 0;
Thread 0
Thread 1
d.store(1,sc);
f.store(1,sc);
while (f.load(sc) == 0)
{};
r = d.load(sc);
Finally: r = 0 ??
Races on Atomic Accesses ignored (now have defined semantics)
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
6 / 34
Shared Memory Concurrency
Multiple threads with a single shared memory
Question: How do we reason about it?
Answer [1979]: Sequential Consistency
. . . the result of any execution is the same
as if the operations of all the processors
were executed in some sequential order,
respecting the order specified by the program.
[Lamport, 1979]
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
7 / 34
Sequential Consistency
Thread 0
Thread 1
Thread 2
Thread 3
(Shared) Memory
Traditional assumption (concurrent algorithms, semantics,
verification): Sequential Consistency (SC)
Implies: can use interleaving semantics
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
8 / 34
Sequential Consistency
Thread 0
Thread 1
Thread 2
Thread 3
(Shared) Memory
Traditional assumption (concurrent algorithms, semantics,
verification): Sequential Consistency (SC)
Implies: can use interleaving semantics
False on modern (since 1972) multiprocessors, or with optimizing
compilers
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
8 / 34
Our world is not SC
Not since IBM System 370/158MP (1972)
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
9 / 34
Our world is not SC
Not since IBM System 370/158MP (1972)
. . . . . . Nor in x86, ARM, POWER, SPARC, Itanium, . . .
. . . . . . Nor in C, C++, Java, . . .
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
10 / 34
Example (contd.): mark atomics relaxed
Mark atomic variables as relaxed (a memory-order parameter)
Initially:
atomic d = 0; f = 0;
Thread 0
Thread 1
d.store(1,rlx);
f.store(1,rlx);
while (f.load(rlx) == 0)
{};
r = d.load(rlx);
Finally: r = 0 ??
(Forbidden on SC)
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
11 / 34
Example (contd.): mark atomics relaxed
Mark atomic variables as relaxed (a memory-order parameter)
Initially:
atomic d = 0; f = 0;
Thread 0
Thread 1
d.store(1,rlx);
f.store(1,rlx);
while (f.load(rlx) == 0)
{};
r = d.load(rlx);
Finally: r = 0 ??
(Forbidden on SC)
Defined, and possible, in C/C++11
Allows for hardware (and compiler) optimisations
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
11 / 34
C11 Concurrency: An Axiomatic Model
Complete executions are considered
(threadwise operational, reading arbitrary values)
Relations defined over memory events (e.g. happens-before)
Predicate says whether execution is consistent
Further, no consistent execution should have races
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
12 / 34
Example (contd.): release-acquire synchronization
Mark release stores and acquire loads
Initially:
atomic d = 0; f = 0;
Thread 0
Thread 1
d.store(1,rlx);
f.store(1,rel);
while (f.load(acq) == 0)
{};
r = d.load(rlx);
Finally: r = 0 ??
(Forbidden on SC)
Forbidden in C/C++11 due to release-acquire synchronization
Implementation must ensure result not observed
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
13 / 34
Example (contd.): release-acquire synchronization
Mark release stores and acquire loads
Initially:
atomic d = 0; f = 0;
Thread 0
Thread 1
d.store(1,rlx);
f.store(1,rel);
while (f.load(acq) == 0)
{};
r = d.load(rlx);
Finally: r = 0 ??
(Forbidden on SC)
Forbidden in C/C++11 due to release-acquire synchronization
Implementation must ensure result not observed
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
13 / 34
Implementation of acquire/release on POWER
Initially:
d = 0; f = 0;
Thread 0
st d 1;
lwsync;
st f 1;
Thread 1
loop:
ld f rtmp;
cmp rtmp 0;
beq loop;
isync;
ld d r;
Finally: r = 0 ??
Forbidden (and not observed) on POWER7, and ARM
lwsync prevents write reordering
control dependency with isync prevents read speculation
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
14 / 34
Correct implementations of C/C++ on hardware
Can it be done?
I
. . . on highly relaxed hardware?
What is involved?
I
Mapping new constructs to assembly
Optimizations: which ones legal?
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
15 / 34
Correct implementations of C/C++ on hardware
Can it be done?
I
. . . on highly relaxed hardware? e.g. POWER/ARM
What is involved?
I
Mapping new constructs to assembly
Optimizations: which ones legal?
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
15 / 34
Implementing C/C++11 on POWER: Pointwise Mapping
C/C++11 Operation POWER Implementation
Store (non-atomic)
Load (non-atomic)
st
ld
Store relaxed
Store release
Store seq-cst
st
lwsync; st
lwsync; st
Load
Load
Load
Load
ld
ld (and preserve dependency)
ld; cmp; bc; isync
hwsync; ld; cmp; bc; isync
relaxed
consume
acquire
seq-cst
Fence acquire
Fence release
Fence seq-cst
CAS relaxed
CAS seq-cst
...
lwsync
lwsync
hwsync
loop: lwarx; cmp; bc exit;
stwcx.; bc loop; exit:
hwsync; loop: lwarx; cmp; bc exit;
stwcx.; bc loop; isync; exit:
...
(From Paul McKenney and Raul Silvera)
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
16 / 34
Implementing C/C++11 on POWER: Pointwise Mapping
C/C++11 Operation POWER Implementation
Store (non-atomic)
Load (non-atomic)
st
ld
Store relaxed
Store release
Store seq-cst
st
lwsync; st
lwsync; st
Load
Load
Load
Load
ld
ld (and preserve dependency)
ld; cmp; bc; isync
hwsync; ld; cmp; bc; isync
relaxed
consume
acquire
seq-cst
Is that mapping correct?
Fence acquire
Fence release
Fence seq-cst
CAS relaxed
CAS seq-cst
...
lwsync
lwsync
hwsync
loop: lwarx; cmp; bc exit;
stwcx.; bc loop; exit:
hwsync; loop: lwarx; cmp; bc exit;
stwcx.; bc loop; isync; exit:
...
(From Paul McKenney and Raul Silvera)
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
16 / 34
Implementing C/C++11 on POWER: Pointwise Mapping
C/C++11 Operation POWER Implementation
Store (non-atomic)
Load (non-atomic)
st
ld
Store relaxed
Store release
Store seq-cst
st
lwsync; st
lwsync; hwsync; st
Load
Load
Load
Load
ld
ld (and preserve dependency)
ld; cmp; bc; isync
hwsync; ld; cmp; bc; isync
relaxed
consume
acquire
seq-cst
Fence acquire
Fence release
Fence seq-cst
lwsync
lwsync
hwsync
CAS relaxed
Answer: No!
CAS seq-cst
hwsync; loop: lwarx; cmp; bc exit;
stwcx.; bc loop; isync; exit:
...
...
loop: lwarx; cmp; bc exit;
stwcx.; bc loop; exit:
(From Paul McKenney and Raul Silvera)
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
16 / 34
Implementing C/C++11 on POWER: Pointwise Mapping
C/C++11 Operation POWER Implementation
Store (non-atomic)
Load (non-atomic)
st
ld
Store relaxed
Store release
Store seq-cst
st
lwsync; st
hwsync; st
Load
Load
Load
Load
ld
ld (and preserve dependency)
ld; cmp; bc; isync
hwsync; ld; cmp; bc; isync
relaxed
consume
acquire
seq-cst
Is that mapping correct?
Fence acquire
Fence release
Fence seq-cst
lwsync
lwsync
hwsync
CAS relaxed
Answer: Yes!
CAS seq-cst
hwsync; loop: lwarx; cmp; bc exit;
stwcx.; bc loop; isync; exit:
...
...
loop: lwarx; cmp; bc exit;
stwcx.; bc loop; exit:
(From Paul McKenney and Raul Silvera)
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
16 / 34
Implementing C/C++11 on POWER: Pointwise Mapping
C/C++11 Operation POWER Implementation
Store (non-atomic)
Load (non-atomic)
st
ld
Store relaxed
Store release
Store seq-cst
st
lwsync; st
hwsync; st
Load
Load
Load
Load
ld
ld (and preserve dependency)
ld; cmp; bc; isync
hwsync; ld; cmp; bc; isync
relaxed
consume
acquire
seq-cst
Is that the only correct mapping?
Fence acquire
Fence release
Fence seq-cst
lwsync
lwsync
hwsync
CAS relaxed
Answer: No!
CAS seq-cst
hwsync; loop: lwarx; cmp; bc exit;
stwcx.; bc loop; isync; exit:
...
...
loop: lwarx; cmp; bc exit;
stwcx.; bc loop; exit:
(From Paul McKenney and Raul Silvera)
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
16 / 34
Implementing C/C++11 on POWER: Pointwise Mapping
C/C++11 Operation POWER Implementation
Store (non-atomic)
Load (non-atomic)
st
ld
Store relaxed
Store release
Store seq-cst
st
lwsync; st
hwsync; st
hwsync; st; hwsync;
Load
Load
Load
Load
ld
ld (and preserve dependency)
ld; cmp; bc; isync
hwsync; ld; cmp; bc; isync
ld; hwsync
Alternative
relaxed
consume
acquire
seq-cst
Fence acquire
Fence release
Fence seq-cst
CAS relaxed
CAS seq-cst
...
lwsync
lwsync
hwsync
loop: lwarx; cmp; bc exit;
stwcx.; bc loop; exit:
hwsync; loop: lwarx; cmp; bc exit;
stwcx.; bc loop; isync; exit:
...
All compilers must agree for separate compilation
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
16 / 34
Implementing C/C++11 on POWER correctly
Theorem: For any sane, non-optimising compiler following the mapping:
C/C++ prog
C/C++11 semantics C/C++11 execution
observations
compilation
POWER prog
POWER semantics
POWER execution
observations
Showed previous mapping incorrect
Easily adapt proof for an alternative mapping
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
17 / 34
Benefits of a formal proof
Reasoning about industrial-strength concurrency
Enables:
Confidence in C/C++ and Power concurrency models
Confidence in compiler implementations [gcc]
Reasoning about C/C++ and Power
(Path to) Reasoning about ARM ??
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
18 / 34
POWER: Hardware Modeling
Hard to see an axiomatic characterisation
Model the microarchitecture (operational model)
But, have to be abstract
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
19 / 34
POWER operational model
Thread
Write request
Read request
Barrier request
Thread
Read response
Barrier ack
Storage Subsystem
Operational model of POWER [PLDI11]
Abstract view of microarchitecture
I
I
Abstract (topology-independent) Storage Subsystem
Speculation in threads visible
Labelled transition systems, synchronising on messages
2500 lines of formal mathematics, described in 3 pages of prose
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
20 / 34
Topology-Independent Storage Subsystem
Thread1
W
W
W
W
W
W
W
ry 3
mo
d3
Me
rea
Th
Thre
Mem
ory
5
y2
or
Mem
ad 2
Thre
W
W
ad5
Memory1
Me
mo
ry
d4
rea
Th
Do not expose topology
Equivalently: Copy of memory per thread
Have to take into account barriers/ordering instructions
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
21 / 34
Cumulativity: Programming on many threads
Initially:
d = 0; f = 0;
Thread 0
Thread 1
Thread 2
st d 1
ld rd d
lwsync
st f 1
loop: ld r1 f;
cmp r1 1;
beq loop;
isync;
ld r r2 ;
Finally: rd = 1 r1 = 1 r = 0 ??
The lwsync is cumulative: it keeps the stores in order for all threads
Flipping the dependency and barrier does not recover SC
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
22 / 34
A (slightly) More Complex Example
Initially:
data = 0; flag = 0;
Thread 0
data = 1;
lwsync;
flag = 1;
Thread 1
while (flag == 0)
{};
tmp = 1;
r1 = tmp;
r = data + (r1 r1 );
Finally: r = 0 ??
Is that behaviour Allowed? Observable?
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
23 / 34
A (slightly) More Complex Example
Initially:
data = 0; flag = 0;
Thread 0
data = 1;
lwsync;
flag = 1;
Thread 1
while (flag == 0)
{};
tmp = 1;
r1 = tmp;
r = data + (r1 r1 );
Finally: r = 0 ??
Is that behaviour Allowed? Observable?
Observed on Power7; Allowed by the model
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
23 / 34
Overall Model Size
Explanation in 3 pages of prose
Microarchitectural intuitions
No extraneous concrete details
2500 lines of machine-processed math
In LEM [ITP11], a simple new semantic metalanguage
Can extract executable code, and theorem-prover code
With OCaml harness: interactive and exhaustive checker
Compilable to browser!
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
24 / 34
Validating the model
Extract executable code from definition, exhaustively enumerate
possible behaviours of tests
Run many iterations of tests on real hardware (Power G5, 6, 7)
Excerpt of results:
Test
WRC+sync+addr
WRC+data+sync
PPOCA
PPOAA
LB
Model
Forbid
Allow
Allow
Forbid
Allow
POWER 6
ok
0 / 16G
ok 150k / 12G
unseen 0 / 39G
ok
0 / 39G
unseen 0 / 31G
POWER 7
ok
0 / 110G
ok 56k / 94G
ok 62k / 141G
ok
0 / 157G
unseen 0 / 176G
Agreed with key IBM Power designers/architects
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
25 / 34
Validating the model
Extract executable code from definition, exhaustively enumerate
possible behaviours of tests
Run many iterations of tests on real hardware (Power G5, 6, 7)
Excerpt of results:
Test
WRC+sync+addr
WRC+data+sync
PPOCA
PPOAA
LB
Model
Forbid
Allow
Allow
Forbid
Allow
POWER 6
ok
0 / 16G
ok 150k / 12G
unseen 0 / 39G
ok
0 / 39G
unseen 0 / 31G
POWER 7
ok
0 / 110G
ok 56k / 94G
ok 62k / 141G
ok
0 / 157G
unseen 0 / 176G
Agreed with key IBM Power designers/architects
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
25 / 34
Validating the model
Extract executable code from definition, exhaustively enumerate
possible behaviours of tests
Run many iterations of tests on real hardware (Power G5, 6, 7)
Excerpt of results:
Test
WRC+sync+addr
WRC+data+sync
PPOCA
PPOAA
LB
Model
Forbid
Allow
Allow
Forbid
Allow
POWER 6
ok
0 / 16G
ok 150k / 12G
unseen 0 / 39G
ok
0 / 39G
unseen 0 / 31G
POWER 7
ok
0 / 110G
ok 56k / 94G
ok 62k / 141G
ok
0 / 157G
unseen 0 / 176G
Agreed with key IBM Power designers/architects
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
25 / 34
C/C++11 Implementation Proof
And Its Consequences
Proof outline
Theorem: For any sane, non-optimising compiler following the mapping:
DRF C/C++ prog
C/C++11 semantics C/C++11 execution
observations
compilation
POWER prog
Susmit Sarkar (St Andrews)
POWER semantics
From C/C++11 to POWER and ARM:
POWER execution
observations
May 2013
27 / 34
Proof outline
Theorem: For any sane, non-optimising compiler following the mapping:
C/C++11 semantics C/C++11 execution
observations
Preserves memory accesses;
Uses the mapping table;
compilation
Respects the thread local semantics of C/C++, preserving
dependencies
POWER semantics
POWER execution
POWER prog
observations
DRF C/C++ prog
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
27 / 34
Proof outline
Theorem: For any sane, non-optimising compiler following the mapping:
DRF C/C++ prog
C/C++11 semantics C/C++11 execution
observations
compilation
POWER semantics
POWER execution
observations
From POWER trace, build key relations (happens-before, SC
order)
Required properties from abs. machine properties
If trace looks like it produces data race, build the C/C++
data race for contradiction
POWER prog
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
27 / 34
Building up happens-before (outline)
C11
Power correspondence
Base case: release-acquire
lwsync and isync
Transitive (multiple rel/acq)
Cumulativity of lwsync
Release-consume with dependencies
lwsync and dependencies
Special rules for CAS
coherence-point reasoning
...
...
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
28 / 34
Using Proofs for Hardware Design
Previously, similar C11 proof for x86-TSO
I
There, much simpler
What properties of Hardware were necessary?
Turns out: x86 Compare-and-Swap have strong properties
Weakening guarantees: Better implementation, just as good
programming [PLDI13]
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
29 / 34
Using Proofs for Hardware Design (2)
Initially:
data = 0; flag = 0;
Thread 0
data = 1;
sync;
flag = 1;
Thread 1
while (flag == 0)
{};
atomically (flag = 2);
r1 = flag;
r = data + (r1 r1 );
Finally: r = 0 ??
Is that Allowed? Observable?
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
30 / 34
Using Proofs for Hardware Design (2)
Initially:
data = 0; flag = 0;
Thread 0
data = 1;
sync;
flag = 1;
Thread 1
while (flag == 0)
{};
atomically (flag = 2);
r1 = flag;
r = data + (r1 r1 );
Finally: r = 0 ??
Is that Allowed? Observable?
C11/C++11 mapping would break (and no good way of fixing)
Fortunately, current hardware does not do this
. . . and now we know why future hardware should not
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
30 / 34
Conclusion
Reasoning about industrial-strength concurrency
Correct compilation of C/C++ concurrency primitives on Power
Confidence in both models
Compiler implementation relevance
Isolate relevant properties of h/w (Path to Hardware Design)
Reasoning about machine code at C/C++ level
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
31 / 34
Thank You!
More details at:
http://www.cl.cam.ac.uk/~pes20/cppppc
Understanding POWER Multiprocessors [PLDI11]
Clarifying and Compiling C/C++ Concurrency: From C++11 to POWER
[POPL12]
Synchronising C/C++ and POWER [PLDI12]
Fast RMWs for TSO: Semantics and Implementation [PLDI13]
The ppcmem tool at:
http://www.cl.cam.ac.uk/~pes20/ppcmem
Model Excerpt
Propagate write to another thread
The storage subsystem can propagate a write w (by thread tid) that it has seen
to another thread tid 0 , if:
the write has not yet been propagated to tid 0 ;
w is coherence-after any write to the same address that has already been
propagated to tid 0 ; and
all barriers that were propagated to tid before w (in
s.events propagated to (tid)) have already been propagated to tid 0 .
Action: append w to s.events propagated to (tid 0 ).
Explanation: This rule advances the thread tid 0 view of the coherence
order to w , which is needed before tid 0 can read from w , and is also
needed before any barrier that is in tids view after w (has w in its Group
A) can be propagated to tid 0 .
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
33 / 34
Model Excerpt
Propagate write to another thread
let write_announce_cand m s w tid =
(w IN s.writes_seen) &&
(tid IN s.threads) &&
(not (List.mem (SWrite w) (s.events_propagated_to tid))) &&
(forall (w IN s.writes_seen).
if List.mem (SWrite w) (s.events_propagated_to tid) && w.w_addr = w.w_addr
then (w,w) IN s.coherence
else true) &&
(forall (b IN barriers_seen s).
if (ordered_before_in (s.events_propagated_to w.w_thread)
(SBarrier b) (SWrite w))
then List.mem (SBarrier b) (s.events_propagated_to tid) else true)
let write_announce_action s w tid =
let events_propagated_to = funupd s.events_propagated_to tid
(add_event (s.events_propagated_to tid) (SWrite w))
<| s with events_propagated_to = events_propagated_to |>
Susmit Sarkar (St Andrews)
From C/C++11 to POWER and ARM:
May 2013
34 / 34