This is the artifact for the following paper:
Kyeongmin Cho, Seungmin Jeon, Azalea Raad, Jeehoon Kang. Memento: A Framework for Detectable Recoverability in Persistent Memory. PLDI 2023.
- In §2, we describe how to design programs that are deterministically replayed after a crash. We do so using two primitive operations, detectably recoverable checkpoint and CAS, by composing them with usual control constructs such as sequential composition, conditionals, and loops.
- In §3, we design a core language for persistent programming and its associated type system for deterministic replay, and prove that well-typed programs are detectably recoverable.
- In §4, we present an implementation of our core language in the Intel-x86 Optane DCPMM architecture. Our construction is not tightly coupled with Intel-x86, and we believe that our implementation can be straightforwardly adapted to other PM architectures.
- In §5, we adapt several volatile, lock-free data structures (DSs) to satisfy our type system, automatically deriving detectable, persistent lock-free DSs. These include a detectable, persistent linked-list Harris 2001, Treiber stack Treiber 1986, Michael-Scott queue Michael and Scott 1996, a combining queue, and Clevel hash table Chen et al. 2020. In doing so, we capture the optimizations of hand-tuned persistent lock-free DSs with additional primitives and type derivation rules (§B and §C), and support safe memory reclamation even in the presence of crashes (§D).
- In §6, we evaluate the detectability and performance of our CAS and automatically derived persistent DSs. They recover from random thread crashes in stress tests (§6.1); and perform comparably with the existing persistent DSs with and without detectability (§6.2).
- Implementation of the Memento framework and its primitives (§4 : src/pmem/ and src/ploc/)
- Implementation of several detectably persistent DSs based on Memento (§5 : src/ds/)
- Evaluation programs (correctness and performance) (§7 : evaluation/)
- Full result data of benchmark (§7 :
evaluation_data/in Zenodo)
You can either reuse a pre-built docker image memento-image.tar from our Zenodo archive or manually build the framework.
- Ubuntu 20.04 or later
- Intel® Optane™ Persistent Memory 100 Series (mounted at
/mnt/pmem0).- In case that a persistent memory is not mounted, you can still perform a limited evaluation on DRAM.
You can reuse a pre-built docker image by loading memento-image.tar:
docker load -i memento-image.tar
docker run -it -v /mnt/pmem0:/mnt/pmem0 --cap-add=SYS_NICE memento # Assuming persistent memory is mounted at /mnt/pmem0Here, -v /mnt/pmem0:/mnt/pmem0 option is conditionally required to share the mounted persistent memory area with the container for the full evaluation. Also, --cap-add=SYS_NICE option is needed to evalute performance by unifying all used cores into a single numa node.
You can re-build a docker image by docker build -t memento .. (It may take more than 30 minutes.)
-
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
-
Additional dependencies for evaluation:
apt install build-essential clang python3-pip numactl \ libpmemobj-dev libvmem-dev libgflags-dev \ libpmemobj1 libpmemobj-cpp-dev \ libatomic1 libnuma1 libvmmalloc1 libvmem1 libpmem1 pip3 install --user pandas matplotlib gitpython
To build our framework including detectable operations, DSs and SMR libraries:
git submodule update --init --recursive
(cd ext/pmdk-rs; git apply ../pmdk-rs.patch)
cargo build --releaseIf persistent memory is not mounted on your machine, add a feature flag with no_persist as follows:
cargo build --release --features no_persistThis artifact aims to achieve the following goals:
- G1: Locating our framework's core concepts (§4,5,B,D) in the development
- G2: Reproducing the detectability evaluation (§6.1)
- G3: Reproducing the performance evaluation (§6.2)
- src/ploc/: persistent memory (PM) infrastructure and primitive operations (§4, §B)
- src/ds/: Memento-based persistent, detectable DSs supporting exactly-once semantics (§5)
- crossbeam-persistency/: safe memory reclamation scheme (§D)
- src/pmem/ll.rs: Low-level PM instructions (§4.1)
- src/pmem/pool.rs: PM pool manager and crash handler (§4.1)
- src/ploc/common.rs: Timestamp calibration (§4.1) and Checkpoint (§4.2)
- src/ploc/detectable_cas.rs: Atomic Pointer Location supporting Detectable CAS (§4.3)
- src/ploc/insert_delete.rs: Insertion and Deletion (§B in Appendix)
- src/ds/comb.rs: A Memento-based detectable combining operation. We convert the original PBComb to one using mementos to support multi-time detectability. (Comb-mmt)
- src/ds/list.rs: A Memento-based lock-free list that uses
DetectableCasandCheckpointbased on Harris' ordered linked list. (List-mmt) - src/ds/treiber_stack.rs: A Memento-based lock-free stack that uses
DetectableCasandCheckpointbased on Treiber's stack. (TreiberS-mmt) - src/ds/queue_general.rs: A Memento-based lock-free queue that uses
DetectableCasandCheckpointbased on Michael-Scott Queue. (MSQ-mmt-O0) - src/ds/queue_lp.rs: A Memento-based lock-free queue that uses
Insert,DeleteandCheckpoint. The difference fromqueue.rsis that this queue uses generallink-persisttechnique rather than exploits DS-specific invariant for issuing less flushes when loading shared pointer. (MSQ-mmt-O1) - src/ds/queue_comb.rs: A Memento-based combining queue that uses
Combiningoperation. (CombQ-mmt) - src/ds/clevel.rs: A Memento-based Clevel extensible hash table. We convert original Clevel to one using mementos. (Clevel-mmt)
- src/ds/queue.rs: A Memento-based lock-free queue that uses
Insert,DeleteandCheckpointbased on Michael-Scott Queue. (MSQ-mmt-O2)
- crossbeam-persistency/crossbeam-epoch/src/guard.rs: "Flushing Location before Retirement"
- crossbeam-persistency/crossbeam-epoch/src/internal.rs: "Allowing Double Retirement"
We evaluate the detectability in case of thread crashes by randomly crashing an arbitrary thread while running the integration test. To crash a specific thread, we use the tgkill system call to send the SIGUSR1 signal to the thread and let its signal handler abort its execution.
cd evaluation/correctness/tcrash
./build.sh # specially build for the thread crash testYou can test each DS with the following command:
./run.sh [tested DS]where tested DS should be replaced with one of supported tests (listed below).
For example, the following command is to infinitely check that the test of MSQ-mmt-O0 in the paper always pass in case of an unexpected thread crash:
./run.sh queue_generalThen the output is printed out like below:
clear queue_general
⎾⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺ thread crash-recovery test queue_general 1 (retry: 0) ⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⏋
run queue_general
[Test 1] success
clear queue_general
⎾⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺ thread crash-recovery test queue_general 2 (retry: 0) ⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⏋
run queue_general
[Test 2] success
clear queue_general
⎾⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺ thread crash-recovery test queue_general 3 (retry: 0) ⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⏋
run queue_general
[Test 3] success
clear queue_general
⎾⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺ thread crash-recovery test queue_general 4 (retry: 0) ⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⏋
run queue_general
^C
It also creates a short progress log and a full test log under ./out.
If a bug exists (just for an example), the output is like below:
clear queue_general
⎾⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺ thread crash-recovery test queue_general 1 (retry: 0) ⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⏋
run queue_general
./run.sh: line 51: 855011 Aborted RUST_BACKTRACE=1 RUST_MIN_STACK=2000000000 numactl --cpunodebind=0 --membind=0 timeout $TIMEOUT $SCRIPT_DIR/../../target/x86_64-unknown-linux-gnu/release/deps/memento-* $target::test --nocapture &>> $log_tmp
fails with exit code 134
[Test 1] fails with exit code 134
clear queue_general
⎾⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺ thread crash-recovery test queue_general 2 (retry: 0) ⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⏋
run queue_general
^C
It then generates a bug directory consisting of a text file containg specific error log (info.txt) and a PM pool files (queue_general.pool_*) of the buggy execution so that we can debug the DS using it.
For each primitive and DS, we observe no test failures for 1M runs with thread crashes.
checkpointdetectable_cas
queue_general: MSQ-mmt-O0 (in the paper)queue_lp: MSQ-mmt-O1queue: MSQ-mmt-O2queue_combCombQ-mmttreiber_stack: TreiberS-mmtlist: List-mmtclevel: Clevel-mmt
We evaluate the correctness of our primitives and DSs using existing bug finding tools, Yashme and PSan. They are finding persistent bugs such as persistency race, missing flushes based on model checking framework Jaaru.
cd evaluation/correctness/pmcheck
./scripts/build_pmcpass.sh # may take more than 10 minutes to build LLVM
./build.shYou can test each DS with the following command:
./run.sh [tested DS] [tool] [mode]where
tested DSshould be replaced with one of supported tests (listed below).tool:yashmeorpsanmode:randomormodel(random testing mode or model checking mode, respectively)
For example, the following command is to test the MSQ-mmt-O0 using PSan with random mode:
./run.sh queue_O0 psan randomThen the output is printed out like below:
Jaaru
Copyright (c) 2021 Regents of the University of California. All rights reserved.
Written by Hamed Gorjiara, Brian Demsky, Peizhao Ou, Brian Norris, and Weiyu Luo
Execution 1 at sequence number 198
nextCrashPoint = 83987 max execution seqeuence number: 88289
nextCrashPoint = 2876 max execution seqeuence number: 4161
Execution 2 at sequence number 4161
nextCrashPoint = 1106 max execution seqeuence number: 4171
nextCrashPoint = 1583 max execution seqeuence number: 4181
Execution 3 at sequence number 4181
nextCrashPoint = 3756 max execution seqeuence number: 4166
nextCrashPoint = 31 max execution seqeuence number: 4176
Execution 4 at sequence number 4176
nextCrashPoint = 2400 max execution seqeuence number: 4181
...
******* Model-checking complete: *******
Number of complete, bug-free executions: 10
Number of buggy executions: 0
Total executions: 10
For each primitive and DS, we observe no buggy executions for 1K runs with random mode.
checkpointdetectable_cas
queue_O0: MSQ-mmt-O0 (in the paper)queue_O1: MSQ-mmt-O1queue_O2: MSQ-mmt-O2queue_combCombQ-mmttreiber_stack: TreiberS-mmtlist: List-mmtclevel: Clevel-mmt
We evaluate the performance of CASes with our benchmark. Each implementation of comparison targets exists in evaluation/performance/cas/src/.
cd evaluation/performance/cas
./build.sh./run.sh # This may take about 3 hoursThis creates CSV data and plots under ./out/.
You can run a single benchamrk,
./target/release/cas_bench -f <filepath> -a <target> -c <locations> -t <threads> -o <output>where
target: mcas (CAS-mmt at paper), pmwcas, nrlcaslocations: number of locations
For example, following command measure the throughput and memory usage of mcas when using 1000 locations and 16 threads.
./target/release/cas_bench -f /mnt/pmem0/mcas.pool -a mcas -c 1000 -t 16 -o ./out/cas-mmt.csv- This creates raw CSV data under
./out/cas-mmt.csv. - To pinning NUMA node 0, you should attach
numactl --cpunodebind=0 --membind=0at the front of the command.
For detailed usage information,
./target/release/cas_bench -hWe evaluate the performance of Memento-based list compared to other detectable list. Each implementation of comparison targets exists in evaluation/performance/list/src/. To evaluate the performance of detectable list based on Tracking, Capsule, Casule-Opt, we use the implementations published by Detectable Recovery of Lock-Free Data Structures (PPoPP '22) authors.
cd evaluation/performance/list
./build.sh./run.sh # This may take about 7 hoursThis creates CSV data and plots under ./out/.
You can run a single benchamrk for list-mmt,
./target/release/bench -f <filepath> -a list-mmt -t <threads> -k <key-range> --insert-ratio <insert-ratio> --delete-ratio <delete-ratio> --read-ratio <read-ratio> -o <outpath>For example, following command measure the throughput of list-mmt with read-intensive workload, when using 16 threads and 500 key ranges.
./target/release/bench -f /mnt/pmem0/list-mmt.pool -a list-mmt -t 16 -k 500 --insert-ratio 0.15 --delete-ratio 0.15 --read-ratio 0.7 -o ./out/list-mmt.csv- This creates raw CSV data under
./out/list-mmt.csv. - To pinning NUMA node 0, you should attach
numactl --cpunodebind=0 --membind=0at the front of the command.
For detailed usage,
./target/release/bench -h
We refer to https://github.com/ConcurrentDistributedLab/Tracking.
We evaluate the performance of Memento-based queues and other queues. Each implementation of comparison targets exists in evaluation/performance/queue/src/.
cd evaluation/performance/queue
./build.sh./run.sh # This may take more than 14 hoursThis creates CSV data and plots under ./out/.
You can run a single benchamrk,
./target/release/bench -f <filepath> -a <target> -k <kind> -t <threads> -i <init_nodes> -o <output>where
target: memento_queue (MSQ-mmt-O2 in the paper), memento_queue_lp (MSQ-mmt-O1 in the paper), memento_queue_general (MSQ-mmt-O0 in the paper), memento_queue_comb (CombQ-mmt in the paper), durable_queue, log_queue, dss_queue, pbcomb_queue, crndm_queuekind: pair (enq-deq pair), prob{n} (n% probability enq or 100-n% deq)
For example, following command measure the throughput of memento_queue with pair workload, when using 16 threads.
./target/release/bench -f /mnt/pmem0/mmt.pool -a memento_queue -k pair -t 16 -i 0 -o ./out/mmt.csv- This creates raw CSV data under
./out/mmt.csv. - To pinning NUMA node 0, you should attach
numactl --cpunodebind=0 --membind=0at the front of the command.
For detailed usage information,
./target/release/bench -h
To run a single benchmark for PMDK and Clobber-NVM queues, you should use separate executables with the following commands.
PMDK queue:
./target/release/bench_cpp <filepath> <target> <kind> <threads> <duration> <init_nodes> <output> # <target> should be "pmdk_queue"Clobber-NVM queue:
PMEM_IS_PMEM_FORCE=1 ./src/clobber-nvm/apps/queue/benchmark-clobber -k <kind> -t <threads> -d 8 -s <duration> -i <init_nodes> -o <output>We used the same benchmark as Persistent Memory Hash Indexes: An Experimental Evaluation (VLDB '21) to evaluate our hash. Each implementation of comparison targets exists in evaluation/performance/hash/hash/.
ulimit -s 8192000
cd evaluation/performance/hash
./build.sh./run.sh # This may take about 30 hoursThis creates raw txt that containing measuring result and plots under ./out/.
You can run a single benchamrk with PiBench executable,
cd bin
./PiBench [lib.so] [args...]where
[lib.so]:clevel.so,clevel_rust.so(Clevel-mmt at paper)[args...]: please see Persistent Memory Hash Indexes repo.
For example, following command measure the search throughput of clevel_rust when using 32 threads with uniform distribution.
./bin/PiBench ./bin/clevel_rust.so \
-S 16777216 \ # initial capacity
-p 200000000 \ # number of operations
-r 1 -i 0 -d 0 \ # read 100%, insert 0%, delete 0%
-M THROUGHPUT --distribution UNIFORM \
-t 32 \You can evaluate clevel_rust on top of the PMDK allocator (instead of Ralloc) by appending pmdk to the build command.
For example:
./build.sh pmdk # This builds clevel_rust on the top of PMDK allocator