Checking Concurrent
Programs
CS3211 Parallel and Concurrent Programming
Increase the confidence in your concurrent programs
• My code works and I don’t know why
• No confidence that the code will always work
• Will making a change break my code?
• How to increase the confidence in your code
• (Classic testing methods)
• Sanitizers (in CS3211 lecture 6)
• Model checking – mathematically proving that the code is correct
• Many approaches out there
• Unfortunately, these approaches are not really used in the industry (much)
CS3211 L12 - Model Checking 2
Model and checking the model
CS3211 L12 - Model Checking 3
Model checking of a formal specification
• Build the model using a special Domain Specific Language (DSL)
• Write a formal specification, usually done in a new language
• Then check the model
• Are all the constraints met?
• Does anything unexpected happen?
• Does it deadlock?
• Why spend time with this?
• Check things make sense before starting the (costly) implementation
• Prove certain properties for existing code
• Allow for aggressive optimizations without compromising correctness
aka “formal methods” approach
CS3211 L12 - Model Checking 4
Formal Methods
Advantages Disadvantages
• Rigorous • The specification used is faulty
• Verify all traces exhaustively • Tedious in coming up with a
• Produce a system run that complete specification
violates the requirement • Time consuming
CS3211 L12 - Model Checking 5
Approaches to Model Checking
• Write a formal specification of the system and check it
• Use a model checker or proof assistant
• Various degrees of automation
• The checker usually checks the states exhaustively
• Use the specification to write the code
• Manually write code
• Automatically generate code from specification
• Add formal specification (invariants that should hold) in the code, as
comments
• Use a model checker to check the invariants
• Difficult to make the model checker understand the code
• Use some symbolic execution
• Limited in functionalities for concurrent code
CS3211 L12 - Model Checking 6
Model checkers for concurrent programs
• TLA+ (TLC) - Temporal Logic of Actions+
• Focusses on temporal properties
• Good for modeling concurrent systems (and distributed systems)
• Coq Proof Assistant
• Generates oCaml, Haskell and Scheme
• Good for interactive proof methods
• Alloy (alloy analyzer)
• Focusses on relational logic
• Good for modeling structures
CS3211 L12 - Model Checking 7
TLA+
• Proposed by Leslie Lamport in 1999
• Defines TLA+ as a "quixotic attempt to overcome engineers' antipathy
towards mathematics"
• High-level language for modeling programs and systems – especially
concurrent and distributed ones
• Based on the idea that the best way to describe things precisely is
with simple mathematics
• Approach
• A specification in TLA+ is written
• The specification is proven (verified) using a checker by exhaustively testing
the states
• Manually write the code based on the TLA+ spec
CS3211 L12 - Model Checking 8
How it works?
• The model checker finds all possible system behaviours (states) up to
some number of execution steps
• Examines the states for violations of desired invariance properties
such as safety and liveness
• TLA+ specifications use basic set theory to define safety (bad things
won't happen) and temporal logic to define liveness (good things
eventually happen)
CS3211 L12 - Model Checking 9
TLA+
• Temporal (time)
• Logic (Boolean logic) of
• Actions (state machines)
• Plus (some stuff)
CS3211 L12 - Model Checking 10
Boolean Logic
CS3211 L12 - Model Checking 11
Boolean Logic
• A predicate is an expression that returns a Boolean
CS3211 L12 - Model Checking 12
Actions – state machines
• State machines
• States
• Transitions
CS3211 L12 - Model Checking 13
State machine for playing chess
CS3211 L12 - Model Checking 14
Formalizing the actions
CS3211 L12 - Model Checking 15
Formalizing the actions
CS3211 L12 - Model Checking 16
The action is the transition
• This is test, not an assignment!
CS3211 L12 - Model Checking 17
Actions are tests
CS3211 L12 - Model Checking 18
Temporal – state transitions over time
• Infinite amount of time
• TLA+ can ask questions like:
• Is something always true?
• Is something ever true?
• If X happens, must Y happen afterwards?
over time
CS3211 L12 - Model Checking 19
Count to three
CS3211 L12 - Model Checking 20
Count to three
CS3211 L12 - Model Checking 21
Count to three
CS3211 L12 - Model Checking 22
Count to three
CS3211 L12 - Model Checking 23
Count to three
CS3211 L12 - Model Checking 24
Count to three
CS3211 L12 - Model Checking 25
Count to three, refactored
CS3211 L12 - Model Checking 26
TLA+ toolbox: IDE and checker
CS3211 L12 - Model Checking 28
CS3211 L12 - Model Checking 29
CS3211 L12 - Model Checking 30
CS3211 L12 - Model Checking 31
Deadlock in TLA+
• Infinite time in TLA+
CS3211 L12 - Model Checking 32
Count to three, updated
CS3211 L12 - Model Checking 33
Doing nothing is always an option!
CS3211 L12 - Model Checking 34
Count to three, with stuttering
CS3211 L12 - Model Checking 35
The power of temporal properties
• A property applies to the whole system over time
• Not just to individual states
• Checking these properties is important
• Humans are bad at this
• Programming languages are bad at this too
• TLA+ can help with this!
CS3211 L12 - Model Checking 36
Properties in TLA+
• Always true
• For all tests, x>0
• Eventually true
• At some point in time, x=2
• Eventually always
• Eventually becomes true (done) and stays there (done)
• x eventually becomes 3 and then stays there
• Leads to
• If x ever becomes 2, then it will become 3 later
CS3211 L12 - Model Checking 37
Properties for “count to three”
CS3211 L12 - Model Checking 38
Adding properties to the script
CS3211 L12 - Model Checking 39
Adding properties to the script
CS3211 L12 - Model Checking 40
Oh no! Model checker says we have errors!
CS3211 L12 - Model Checking 41
Stuttering caused a loop!
CS3211 L12 - Model Checking 42
Fixing the error
• Make sure every possible transition is followed
• Don’t get stuck in an infinite loop
Add fairness! TLA+ can model this
CS3211 L12 - Model Checking 43
CS3211 L12 - Model Checking 44
CS3211 L12 - Model Checking 45
CS3211 L12 - Model Checking 46
CS3211 L12 - Model Checking 47
CS3211 L12 - Model Checking 48
The complete spec with fairness
CS3211 L12 - Model Checking 49
The complete spec with fairness
CS3211 L12 - Model Checking 50
A more complicated example
• Very exciting: we can count to three!
• What about a more complicated problem?
• What about concurrency?
• Property checking is where TLA+ is powerful and it can help
CS3211 L12 - Model Checking 51
Producer/consumer problem
Producer: Consumer:
• Check if queue is not full • Check if queue is not empty
• If true, then write item to queue • If true, then read item from
queue
CS3211 L12 - Model Checking 52
CS3211 L12 - Model Checking 53
CS3211 L12 - Model Checking 54
CS3211 L12 - Model Checking 55
CS3211 L12 - Model Checking 56
Embedded concurrency!
CS3211 L12 - Model Checking 57
CS3211 L12 - Model Checking 58
CS3211 L12 - Model Checking 59
Temporal properties for the producer/consumer
• 8 states, no errors
• BUT only for 1 producer and 1 consumer!
CS3211 L12 - Model Checking 60
Concurrent version with multiple producers/consumers
• Use the Plus in TLA+
• We need
• A set of producers
• A set of consumers
• Use the set-description part of TLA+
CS3211 L12 - Model Checking 61
Plus… set theory!
CS3211 L12 - Model Checking 62
CS3211 L12 - Model Checking 63
CS3211 L12 - Model Checking 64
CS3211 L12 - Model Checking 65
CS3211 L12 - Model Checking 66
CS3211 L12 - Model Checking 67
CS3211 L12 - Model Checking 68
CS3211 L12 - Model Checking 69
CS3211 L12 - Model Checking 70
CS3211 L12 - Model Checking 71
CS3211 L12 - Model Checking 72
Running the script
• Run the model checker with 2 producers and 2 consumers
• Use the AlwaysWithinBounds property
• There are 38 states
• Error: Invariant AlwaysWithinBounds is violated!
• The design does not work
CS3211 L12 - Model Checking 73
Fixing the error
• TLA+ won’t tell you how to fix the error
• You must fix the spec
• Easy to test the fixes
• Update the spec to use atomic operations (or locks)
• Re-run the model checker!
• You gain confidence in your design
CS3211 L12 - Model Checking 74
The power of TLA+
• TLA+ can be used to model large concurrent systems
• Such as distributed systems!
• Examples where TLA+ can help:
• https://hillelwayne.com/modeling-deployments/
• https://hillelwayne.com/talks/distributed-systems-tlaplus/
• Learn more: https://learntla.com/index.html
CS3211 L12 - Model Checking 75
Why concurrent programs are important?
• They are everywhere nowadays because we all use distributed
systems
• Distributed systems use the most complex programs
• systems that span the world
• serve millions of users
• and are always available
• Incredibly relevant today as everything is a distributed system!
CS3211 L12 - Model Checking 76
Definition of Distributed Systems
• Distributed system is a system where multiple processes located on
networked computers communicate via messages to achieve a
common goal
• "A distributed system is one in which the failure of a computer you
didn't even know existed can render your own computer unusable.“,
Leslie Lamport
• Examples: client-server applications, map-reduce, grid computing,
peer-to-peer networks, skype, cloud computing, email clients, music
streaming, ftp connection, hadoop, web service compositions, video
streaming, etc.
CS3211 L12 - Model Checking 77
Definition of Distributed Systems
• Distributed system is a system where multiple processes located on
networked computers communicate via messages to achieve a
common goal
• "A distributed system is one in which the failure of a computer you
didn't even know existed can render your own computer unusable.“,
Leslie Lamport
• Examples: client-server applications, map-reduce, grid computing,
peer-to-peer networks, skype, cloud computing, email clients, music
streaming, ftp connection, hadoop, web service compositions, video
streaming, etc.
CS3211 L12 - Model Checking 78
Definition of Distributed Systems
• Distributed system is a system where multiple processes located on
networked computers communicate via messages to achieve a
common goal
• "A distributed system is one in which the failure of a computer you
didn't even know existed can render your own computer unusable.“,
Leslie Lamport
• Examples: client-server applications, map-reduce, grid computing,
peer-to-peer networks, skype, cloud computing, email clients, music
streaming, ftp connection, hadoop, web service compositions, video
streaming, etc.
CS3211 L12 - Model Checking 79
Major Challenges (1)
• No global clock, no ordering of events
• Events happen at different times
• Different interleaving of events are possible
• The participants see the events interleaving in different ways
• Use a logical clock (happens-before)
~ sounds like a memory model is needed
CS3211 L12 - Model Checking 80
Major Challenges (2)
Resilience – Consistency - Consensus
Possible errors due to the concurrent nature of the system:
• Deadlocks
• Livelock/starvation
• Lack of consensus
~ knowing classical synchronization problems might
help you solve some particular situation
CS3211 L12 - Model Checking 81
2015: Formal Methods at AWS *
• Precise description of system in TLA+ (PlusCal language - like c)
• In 6 large complex real world systems
• 7 teams
• Found subtle bugs
• Confidence to make aggressive optimizations w/o sacrificing
correctness
• Use formal specification to teach system to new engineers
* How Amazon Web Services Uses Formal Methods by Chris Newcombe et al. (Communications of the
ACM, 2015)
CS3211 L12 - Model Checking 83
2015: Formal Methods at AWB
CS3211 L12 - Model Checking 84
2021: Using Lightweight Formal Methods to Validate a Key-Value
Storage Node in Amazon S3
• S3’s new ShardStore storage node
• Built in Rust
• Crash consistency, concurrency, IO,etc
• Specs alongside the code
• Reference model spec
• Decompose correctness checks
• Sequential correctness
• Crashes
• Concurrency
• Accept weaker correctness guarantees then full formal verification
• Adding continuous validation validation
* Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 by James
Bornholt et al., SOSP2021. https://www.youtube.com/watch?v=YdxvOPenjWI
CS3211 L12 - Model Checking 85
Conclusion
• Formal verification (model checking) bring guarantees and allows us
to check properties
• Formally checking the concurrent code is here to stay
• More engineers will need to write formal specs for their code
• Industry is adapting and using model checkers, especially for newly
developed systems
• References:
• https://www.youtube.com/watch?v=tqwcz-Yt9gQ
CS3211 L12 - Model Checking 86