Indian Institute of Science (IISc), Bangalore, India
Memory consistency
models and
synchronizations
Part 1
www.csa.iisc.ac.in
Indian Institute of Science (IISc), Bangalore, India
Acknowledgements
Several of the slides in the deck are from Luis Ceze
(Washington), Nima Horanmand (Stony Brook),
Mark Hill, David Wood, Karu Sankaralingam
(Wisconsin), Abhishek Bhattacharjee(Rutgers).
Development of this course is partially supported
by Western Digital corporations.
8/9/2018 2
Indian Institute of Science (IISc), Bangalore, India
Coherence vs. Consistency
A=0 FLAG=0
C 0 C 1
ST A 1; L1: LD r1 FLAG
ST FLAG 1; If r1 == 0; JMP L1;// spin lock
LD r2 A;
10/7/201
3
9
Indian Institute of Science (IISc), Bangalore, India
Coherence vs. Consistency
A=0 FLAG=0
C 0 C 1
ST A 1; L1: LD r1 FLAG
ST FLAG 1; If r1 == 0; JMP L1;// spin lock
LD r2 A;
What value should r2 contain?
87
Indian Institute of Science (IISc), Bangalore, India
Coherence vs. Consistency
A=0 FLAG=0
C 0 C 1
ST A 1; L1: LD r1 FLAG
ST FLAG 1; If r1 == 0; JMP L1;// spin lock
LD r2 A;
What value should r2 contain? (expect r2=1)
But what does the coherence says?
87
Indian Institute of Science (IISc), Bangalore, India
6
Indian Institute of Science (IISc), Bangalore, India
7
Indian Institute of Science (IISc), Bangalore, India
Coherence vs. Consistency
A=0 FLAG=0
C 0 C 1
ST A 1; L1: LD r1 FLAG
ST FLAG 1; If r1 == 0; JMP L1;// spin lock
LD r2 A;
What value should r2 contain? (expect r2=1)
But what does the coherence says? Nothing
Coherence is about total ordering of stores and loads to one given memory
location (in practice a single cache block)
Says nothing about ordering across different memory locations/addresses
87
Indian Institute of Science (IISc), Bangalore, India
Coherence vs. Consistency
A=0 FLAG=0
C 0 C 1
ST A 1; L1: LD r1 FLAG
ST FLAG 1; If r1 == 0; JMP L1;// spin lock
LD r2 A;
What value should r2 contain? (expect r2=1)
But what does the coherence says? Nothing
Coherence is about total ordering of stores and loads to one given memory
location (in practice a single cache block)
Says nothing about ordering across different memory locations/addresses
But correctly specifying/writing multi-threaded program requires guarantees for
ordering of load/stores to different locations
87
Indian Institute of Science (IISc), Bangalore, India
Memory consistency models
A memory consistency model specifies global orders of writes to all memory
locations relative to each other
A memory consistency model tells what are the legal reordering of loads/stores
to different memory locations
Different memory consistency models possible
Tradeoff between ease of programmability vs. performance
“Relaxed” models enforces less ordering (better performance) but harder to program
Memory consistency model part of ISA
X86 and ARM has different memory consistency model
Cache coherent protocol’s aren’t – software only needs to know if the hardware supports cache
coherence or not
87
Indian Institute of Science (IISc), Bangalore, India
Memory consistency models
Specifies which re-ordering of loads and stores, to
different addresses, are allowed
A=0 FLAG=0
C 0 C 1
ST A 1; L1: LD r1 FLAG
ST FLAG 1; If r1 == 0; JMP L1;// spin lock
LD r2 A;
10/7/201
8
9
Indian Institute of Science (IISc), Bangalore, India
Memory consistency models
Specifies which re-ordering of loads and stores, to
different addresses, are allowed
A=0 FLAG=0
C 0 C 1
ST A 1; L1: LD r1 FLAG
ST FLAG 1; If r1 == 0; JMP L1;// spin lock
LD r2 A;
For example, this program will work fine if load, store
bypassing is not allowed within a given thread, even if
they are to different addresses
10/7/201
8
9
Indian Institute of Science (IISc), Bangalore, India
Memory consistency models
Specifies which re-ordering of loads and stores (to
different addresses) are allowed
A=0 FLAG=0
C 0 C 1
ST A 1; L1: LD r1 FLAG
ST FLAG 1; If r1 == 0; JMP L1;// spin lock
LD r2 A;
For example, this program will work fine if load, store
bypassing is not allowed within a given thread, even if
they are to different addresses
Parallel programmers rely on the memory model to
reason about correctness of your program
10/7/201
8
9
Indian Institute of Science (IISc), Bangalore, India
Who defines memory
consistency models?
Any programmer of parallel shared memory
programs should care of memory consistency
models
Defined by the H/W –described in ISA
Programming languages should also need to define
their memory consistency models
‣ Determines what optimizations are allowed
10/7/201
9
9
Indian Institute of Science (IISc), Bangalore, India
Many possible memory
consistency models
Sequential consistency
‣ Most intuitive, programmer friendly
‣ But most restriction in terms of performance optimizations
‣ e.g., implemented in MIPS R10K
Total store order (processor consistency model)
‣ Less restrictive than
‣ Intel, AMD’s x86’64 processors
‣ Sun UltraSPARC
Weak memory models (release consistency models)
‣ Most relaxed, burdens programmer but allows more
hardware optimizations
‣ ARM, IBM POWER processors
10/7/201
10
9
Indian Institute of Science (IISc), Bangalore, India
Sequential consistency
...
P1 P2 P3 PN
st A st C ld C ld A
Shared memory
Per-processor program order: memory operations from
individual processors maintain program order
Single sequential order: the memory operations from all
processors maintain a single sequential order
[Lamport’79]
10/7/201
16 11
9
Indian Institute of Science (IISc), Bangalore, India
Sequential consistency
... C1 C2
P1 P2 P3 PN
st A st C ld C ld A
st A st A
ld C st C
st C ld D
Shared memory
Per-processor program order: memory operations from
individual processors maintain program order
Single sequential order (Total order): memory
operations from all processors maintain a single
sequential order
10/7/201
17 11
9
Indian Institute of Science (IISc), Bangalore, India
Sequential consistency
A possible legal
... C1 C2 global order
P1 P2 P3 PN
st A
st A st A
st A st C ld C ld A st A
ld C st C
ld C
st C ld D st C
st C
ld D
Shared memory
Per-processor program order: memory operations from
individual processors maintain program order
Single sequential order (Total order): memory
operations from all processors maintain a single
sequential order
10/7/201
18 11
9
Indian Institute of Science (IISc), Bangalore, India
Sequential consistency
A possible legal
... C1 C2 global order
P1 P2 P3 PN
st A st A
st A st A
st A st C ld C ld A st A ld C
ld C st C st A
ld C
st C ld D st C st C
st C st C
ld D ld D
Shared memory
Per-processor program order: memory operations from
individual processors maintain program order
Single sequential order (Total order): memory
operations from all processors maintain a single
sequential order
10/7/201
19 11
9
Indian Institute of Science (IISc), Bangalore, India
Sequential consistency (SC)
A=0 FLAG=0
C 0 C 1
ST A 1; L1: LD r1 FLAG
ST FLAG 1; If r1 == 0; JMP L1;// spin lock
LD r2 A;
What will be the value loaded in r2 under SC?
10/7/201
12
9
Indian Institute of Science (IISc), Bangalore, India
Reordering of load/stores
Program order
Earlier operation LD ST
Later operation
LD NO NO
ST NO NO
Allowed reordering of load/stores to different addresses under
sequential consistency.
No reordering for load/stores to same address – ensured by
coherence
10/7/201
13
9
Indian Institute of Science (IISc), Bangalore, India
Food for thought (assume SC)
• Answer the following questions:
• Initially: all variables zero (that is, x is 0, y is 0, flag is 0, A is 0)
• What value pairs can be read by the two loads? (x, y) pairs:
Core 0 Core 1
LD x ST y 1
LD y ST x 1
110
Indian Institute of Science (IISc), Bangalore, India
Food for thought (assume SC)
• Answer the following questions:
• Initially: all variables zero (that is, x is 0, y is 0, flag is 0, A is 0)
• What value pairs can be read by the two loads? (x, y) pairs:
Core 0 Core 1
LD x ST y 1 How about (1,0)?
LD y ST x 1
110
Indian Institute of Science (IISc), Bangalore, India
Food for thought (assume SC)
• Answer the following questions:
• Initially: all variables zero (that is, x is 0, y is 0, flag is 0, A is 0)
• What value pairs can be read by the two loads? (x, y) pairs:
Core 0 Core 1
LD x ST y 1 How about (1,0)?
LD y ST x 1
•What value pairs can be read by the two loads? (x, y) pairs:
Core 0 Core 1
ST y 1 ST x 1 How about (0,0)?
LD x LD y
110
Indian Institute of Science (IISc), Bangalore, India
Problems with SC memory model
Difficult to implement efficiently in hardware
‣ Straight-forward implementations of SC dictate:
• Strict ordering of memory accesses at each processors
• Essentially precludes most out-of-order CPU benefits
→ Conflicts with common latency-hiding techniques
Constrains compiler optimizations
‣ Disallows code motion, common subexpression elimination
Implementations of SC which tries to extract
concurrency of accesses are complex
‣ e.g., MIPS R10K
No commercial processors implement SC today
Indian Institute of Science (IISc), Bangalore, India
Constraints of SC: Write buffer
Why have a write (store buffer) buffer?
Core
WB
L1 Cache
Indian Institute of Science (IISc), Bangalore, India
Constraints of SC: Write buffer
Why have a write (store buffer) buffer?
Core
Can existence of write buffer break SC? WB
L1 Cache
Indian Institute of Science (IISc), Bangalore, India
Write buffer breaks SC
Core 0 Core 1
WB WB
L1 Cache L1 Cache
Coherent
LLC
Indian Institute of Science (IISc), Bangalore, India
Write buffer breaks SC
Core 0 Core 1
ST y 1 ST x 1
LD x LD y
Core 0 Core 1
WB WB
L1 Cache L1 Cache
Coherent
LLC
Indian Institute of Science (IISc), Bangalore, India
Write buffer breaks SC
Core 0 Core 1
ST x 1
LD x LD y
ST y 1 Core 1
WB WB
L1 Cache L1 Cache
Coherent
LLC
Indian Institute of Science (IISc), Bangalore, India
Write buffer breaks SC
Core 0 Core 1
ST x 1
LD x LD y
Core 1
WB y 1 WB
L1 Cache L1 Cache
Coherent
LLC
Indian Institute of Science (IISc), Bangalore, India
Write buffer breaks SC
Core 0 Core 1
LD x LD y
ST x 1
WB y 1 WB
L1 Cache L1 Cache
Coherent
LLC
Indian Institute of Science (IISc), Bangalore, India
Write buffer breaks SC
Core 0 Core 1
LD x LD y
WB y 1 WB x 1
L1 Cache L1 Cache
Coherent
LLC
Indian Institute of Science (IISc), Bangalore, India
Write buffer breaks SC
Core 0 Core 1
LD y
LD x
WB y 1 WB x 1
L1 Cache L1 Cache
Coherent
LLC
Indian Institute of Science (IISc), Bangalore, India
Write buffer breaks SC
Core 0 Core 1
LD y
WB y 1 WB x 1
LD x=0 L1 Cache L1 Cache
Coherent
LLC
Indian Institute of Science (IISc), Bangalore, India
Write buffer breaks SC
Core 0 Core 1
LD y
WB y 1 WB x 1
LD x=0 L1 Cache L1 Cache
Coherent
LLC
Indian Institute of Science (IISc), Bangalore, India
Write buffer breaks SC
Core 0 Core 1
WB y 1 WB x 1
LD x=0 L1 Cache L1 Cache LD y=0
Coherent
LLC
Indian Institute of Science (IISc), Bangalore, India
Write buffer breaks SC
Core 0 Core 1
X= 0
Y= 0
NOT allowed
by SC !
WB y 1 WB x 1
LD x=0 L1 Cache L1 Cache LD y=0
Coherent
LLC
Indian Institute of Science (IISc), Bangalore, India
Alternative memory model
Total store order (TSO) memory model
Program order
Earlier operation LD ST
Later operation
LD NO YES
ST NO NO
Allowed reordering of load/stores under TSO
Indian Institute of Science (IISc), Bangalore, India
Alternative memory model
Total store order (TSO) memory model
Program order
Earlier operation LD ST
Later operation
LD NO YES
ST NO NO
Allowed reordering of load/stores under TSO
Remember that reordering allowed only if ST/LD are to
different addresses.
Indian Institute of Science (IISc), Bangalore, India
Alternative memory model
Total store order (TSO) memory model
‣ Implemented by Intel, AMD and Sun/Oracle's SPARC
processors
‣ It is sometime called processor consistency model
Program order
Earlier operation LD ST
Later operation
LD NO YES
ST NO NO
Allowed reordering of load/stores under TSO
Remember that reordering allowed only if ST/LD are to
different addresses.
Indian Institute of Science (IISc), Bangalore, India
TSO vs. SC
What performance optimization would TSO allow?
‣ (That is not allowed by SC)
Indian Institute of Science (IISc), Bangalore, India
TSO vs. SC
What performance optimization would TSO allow?
‣ (That is not allowed by SC)
‣ A FIFO write buffer
• Still need to maintain store-to-store order
Indian Institute of Science (IISc), Bangalore, India
TSO vs. SC
What performance optimization would TSO allow?
‣ (That is not allowed by SC)
‣ A FIFO write buffer
• Still need to maintain store-to-store order
What is disadvantage of TSO?
‣ Some programs will break
Indian Institute of Science (IISc), Bangalore, India
What breaks under TSO?
A=0 FLAG=0
C 0 C 1
ST A 1; L1: LD r1 FLAG
ST FLAG 1; If r1 == 0; JMP L1;// spin lock
LD r2 A;
Indian Institute of Science (IISc), Bangalore, India
What breaks under TSO?
A=0 FLAG=0
C 0 C 1
ST A 1; L1: LD r1 FLAG
ST FLAG 1; If r1 == 0; JMP L1;// spin lock
LD r2 A;
This will work as expected
Indian Institute of Science (IISc), Bangalore, India
What breaks under TSO?
A=0 FLAG=0
C 0 C 1
ST A 1; L1: LD r1 FLAG
ST FLAG 1; If r1 == 0; JMP L1;// spin lock
LD r2 A;
This will work as expected
TSO allows only later load to bypass previous stores
to different address
Indian Institute of Science (IISc), Bangalore, India
What breaks under TSO?
Core 0 Core 1
ST y 1 ST x 1
LD x LD y
Indian Institute of Science (IISc), Bangalore, India
What breaks under TSO?
Core 0 Core 1
ST y 1 ST x 1
LD x LD y
Is x=0, y=0 possible?
Indian Institute of Science (IISc), Bangalore, India
What breaks under TSO?
Core 0 Core 1
ST y 1 ST x 1
LD x LD y
Is x=0, y=0 possible?
Yes, because later load to different address can
bypass earlier store
Indian Institute of Science (IISc), Bangalore, India
What if the programmer wants SC
like ordering?
Special fence instructions to explicitly introduce
ordering
‣ Example, mfence instruction in x86/x86-64
‣ Programmer needs to insert them manually
Indian Institute of Science (IISc), Bangalore, India
What if the programmer wants SC
like ordering?
Special fence instructions to explicitly introduce
ordering
‣ Example, mfence instruction in x86/x86-64
‣ Programmer needs to insert them manually
Earlier operation LD ST mfence
Later operation
LD NO YES NO
ST NO NO NO
mfence NO NO NO
Indian Institute of Science (IISc), Bangalore, India
Fences to order load/stores
Core 0 Core 1
ST y 1 ST x 1
mfence mfence
LD x LD y
Indian Institute of Science (IISc), Bangalore, India
Fences to order load/stores
Core 0 Core 1
ST y 1 ST x 1
mfence mfence
LD x LD y
x=0, y=0 not possible anymore SC compliant
Indian Institute of Science (IISc), Bangalore, India
Fences to order load/stores
Core 0 Core 1
ST y 1 ST x 1
LD Y LD X
LD x LD y
Is x=0, y=0 still possible?
Indian Institute of Science (IISc), Bangalore, India
Store atomicity
A memory consistency model supports store
atomicity iff all cores see the stores in the same
order
TSO implemented in x86-64 does not guarantee
store atomicity
‣ A core can “see” its own store early
10/12/20
55
24
Indian Institute of Science (IISc), Bangalore, India
No store atomicity in x86-64
Core 0 Core 1
LD x LD y
LD y=1 LD X=1
Load-to-store
WB ST y 1 Load-to-store WB ST x 1
bypass
bypass
L1 Cache L1 Cache
Coherent
LLC
Indian Institute of Science (IISc), Bangalore, India
No store atomicity in x86-64
Core 0 Core 1
LD x=0 LD y=0
WB ST y 1 WB ST x 1
L1 Cache L1 Cache
LD y=1 LD X=1
Coherent
LLC
Indian Institute of Science (IISc), Bangalore, India
No store atomicity in x86-64
Core 0 Core 1
X= 0
Y= 0
LD x=0 LD y=0 NOT allowed
by SC !
WB ST y 1 WB ST x 1
L1 Cache L1 Cache
LD y=1 LD X=1
Coherent
LLC
Indian Institute of Science (IISc), Bangalore, India
Store atomicity
A memory consistency model supports store
atomicity iff all cores see the stores in the same
order
TSO implemented in x86-64 does not guarantee
store atomicity
‣ A core can “see” its own store early
TSO implemented in IBM 370 guaranteed store
atomicity
‣ Load can see bypassed value during execution, but it
stalls until the store before it makes to cache
10/12/20
58
24