Aislsm: Revolutionizing The Compaction With Asynchronous I/Os For Lsm-Tree
Aislsm: Revolutionizing The Compaction With Asynchronous I/Os For Lsm-Tree
Abstract—The log-structured merge tree (LSM-tree) is widely loads [12]–[14]. LSM-tree intentionally does flushes and com-
employed to build key-value (KV) stores. LSM-tree organizes pactions in the background. However, if a few memtables are
multiple levels in memory and on disk. The compaction of LSM- waiting for flush or many SST files are pending compaction,
tree, which is used to redeploy KV pairs between on-disk levels
in the form of SST files, severely stalls its foreground service. We LSM-tree stalls foreground service [15]–[18]. Such stalls incur
arXiv:2307.16693v1 [cs.DB] 31 Jul 2023
overhaul and analyze the procedure of compaction. Writing and significant performance penalty [12, 16, 19]. We have taken
persisting files with fsyncs for compacted KV pairs are time- RocksDB [3] for a quantitative study. We conduct experiments
consuming and, more important, occur synchronously on the by running it on an NVMe solid-state drive (SSD). It spends
critical path of compaction. The user-space compaction thread overall 1,399.9 seconds in finishing Put requests for 80GB
of LSM-tree stays waiting for completion signals from a kernel-
space thread that is processing file write and fsync I/Os. with 16B/1KB per KV pair and four foreground threads.
We accordingly design a new LSM-tree variant named AisLSM However, it stalls for 1,179.1 seconds, i.e., 84.2% of total
with an asynchronous I/O model. In short, AisLSM conducts time. By forcefully disabling compactions, the throughput of
asynchronous writes and fsyncs for SST files generated in a RocksDB increases by 5.7×. This substantial leap motivates
compaction and overlaps CPU computations with disk I/Os for us to shorten the critical path of compaction for LSM-tree.
consecutive compactions. AisLSM tracks the generation depen-
dency between input and output files for each compaction and As mentioned, a compaction is composed of three re-
utilizes a deferred check-up strategy to ensure the durability of peated actions, i.e., CPU computation (mainly for merge-sort),
compacted KV pairs. We prototype AisLSM with RocksDB and file write, and fsync. LSM-tree synchronously proceeds
io uring. Experiments show that AisLSM boosts the performance them [14, 17]. Our study shows that CPU computations, file
of RocksDB by up to 2.14×, without losing data accessibility and writes, and fsyncs contribute 47.7%, and 6.3%, 46.0% in
consistency. It also outperforms state-of-the-art LSM-tree vari-
ants with significantly higher throughput and lower tail latency. the time cost per compaction on average, respectively. In
Index Terms—LSM-tree, Asynchronous I/O, Compaction each compaction, RocksDB’s user thread runs on a CPU core
for computations to prepare sorted KV pairs and then keeps
I. I NTRODUCTION waiting for the completion of file write and fsync which,
The log-structured merge tree (LSM-tree) gains wide pop- however, are conducted by a kernel thread. If we avoid waiting
ularity in building key-value (KV) stores [1]–[11]. LSM-tree on the critical path of compaction but asynchronously handle
appends arriving KV pairs to an on-disk log and inserts them I/Os, the performance of LSM-tree should be accelerated.
into in-memory memtables, each of which is a structure (e.g., Assuming that a kernel thread is processing I/Os for the current
skiplist) ordered by keys. Once a memtable becomes full compaction job, LSM-tree’s user thread can simultaneously
according to a preset size limit, LSM-tree makes it immutable. compute for the next compaction job. This summarizes our
LSM-tree transforms an immutable memtable to a sorted aim in this paper, i.e., orchestrating CPU computations (resp.
string table (SST) file and puts it onto the tree’s top level on user thread) and disk I/Os (resp. kernel thread) to revolutionize
disk, i.e., L0 . This is referred to as flush1 . LSM-tree defines a compaction and optimize LSM-tree.
capacity limit for each on-disk Ln (n ≥ 0) to hold a number Today’s hardware and software jointly provide a promising
of SST files. The limit of Ln+1 is usually ten times that of Ln . opportunity for us to do so. For hardware, compared to con-
When Ln is full, LSM-tree initiates a compaction, in which ventional hard disk drive (HDD) or SATA SSD, NVMe SSD
LSM-tree merge-sorts KV pairs residing in selected Ln and enables higher processing speed [20, 21]. The aforementioned
Ln+1 SST files that have key ranges overlapped ( 1 ), writes percentages for file write and fsync I/Os with NVMe SSD
sorted KV pairs into a new Ln+1 SST file ( 2 ), and persists roughly match that of CPU computations ( 6.3%+46.0%
47.7% ), such
the file with fsync ( 3 ). LSM-tree repeats 1 to 3 until all that forthcoming computations are unlikely to be blocked by
KV pairs are persisted in output SST files. Then it deletes uncompleted asynchronous I/Os that have been scheduled but
input SST files and completes the compaction. not finished yet. As to software, researchers have subsumed
The foreground operations of logging and insertion with legacy Linux native AIO with the io uring framework [22]–
memtable make LSM-tree appealing for write-intensive work- [24]. The io uring works in the kernel space with high
efficiency and capacious interfaces for asynchronous I/Os.
Y. Hu and L. Zhu contribute equally to this work. C. Wang is the Not much attention has been paid to the impact of syn-
corresponding author (cd [email protected]).
1 Researchers also use ‘flush’ to describe a program calling fsync to write chronous I/Os on LSM-tree. Kim et al. [25] noticed that
down a file, which we refer to as ‘persist’ for distinguishing in this paper. persisting data in a batched fsync is more efficient than doing
so for multiple batches. They designed BoLT that aggregates PUT
SSTableholding
compacted KV pairs in a huge SST file for one fsync. X, Y
keys range in X, Y.
However, BoLT still retains fsyncs on the critical path of
compaction. Dang et al. [26] proposed NobLSM that partly re- Immutable
Memtable
Flush Memtable
places fsyncs with the periodical commits of Ext4. Though, Memory
NobLSM lacks portability as it relies on Ext4 mounted in Persistent Storage
the data=ordered mode. Worse, it demands handcrafted WAL L0 4, 69 23, 102
Compaction
customization in the kernel of operating system (OS). CURRENT
L1 5, 89 97, 201
When leveraging asynchronous I/Os to revolutionize the MANIFEST
LOG … … Compaction
compaction, we shall neither keep fsync on the critical path
Ln 13, 59 76, 102 123, 256
nor incur changes to system software. In addition, as conven-
tional LSM-tree employs synchronous I/Os, all compacted KV Fig. 1: The Architecture of RocksDB
pairs become both visible for reading and durable for recovery
at the end of a compaction. In other words, these KV pairs
dling [19, 27], key-value separation [19, 28, 29], and concur-
simultaneously gain the visibility and durability. Whereas,
rent or pipelined compactions [14, 17, 30] proposed in previous
asynchronous I/Os introduce uncertainty to such properties.
works. The shortened critical path of compaction AisLSM
With foregoing observations and concerns, we propose an brings about complements those techniques. We prototype
LSM-tree variant named AisLSM. AisLSM employs asyn- AisLSM by modifying RocksDB with io uring. Experiments
chronous file write and fsync for each new output SST file confirm that AisLSM dramatically boosts the performance of
that a compaction generates from existing input SST files, RocksDB, with up to 2.14× throughput. It also significantly
thereby removing synchronous I/Os from the critical path. outperforms state-of-the-art designs, including ADOC [16],
It calls io uring interfaces to do so, without changing the TRIAD [19], Rocks-bu [31], SILK [12], PhotonDB [32],
OS’s kernel, file system, or storage device. The completion and NobLSM [26]. For example, in a write-intensive test,
of asynchronously writing an output SST file makes the file’s the tail latency of AisLSM is 48.8%, 51.9%, 99.0%, 59.8%,
KV pairs steadily accessible in the OS’s page cache or device’s 16.9%, 50.4%, and 61.4% less than that of RocksDB, ADOC,
disk cache, so the visibility of KV pairs is enabled. The output TRIAD, Rocks-bu, SILK, PhotonDB, and NobLSM, respec-
file may not be durable yet. However, provided that any input tively. Such a substantial gap justifies the efficacy of AisLSM’s
file in which KV pairs have stayed is durable, the durability asynchronous I/O model for compaction. We also verify that
of KV pairs is still guaranteed. AisLSM retains durable input AisLSM has no loss of accessibility or recoverability for data.
files to protect the durability of compacted KV pairs until it
The remainder of this paper is organized as follows. In
perceives the durability of output files. Concretely, AisLSM
Section II we present the background of LSM-tree and asyn-
decouples the durability from visibility for compacted KV
chronous I/Os. We brief our motivational study in Section III.
pairs. The main contributions of this paper are as follows.
We detail the design and implementation of AisLSM in
• We analytically overhaul the compaction procedure for Sections IV and V, respectively. We quantitatively evaluate
LSM-tree. We quantitatively reveal the significant impact AisLSM in Section VI. We compare AisLSM to related works
of synchronous writes and fsyncs employed in each in Section VII and conclude the paper in Section VIII.
compaction on the performance of LSM-tree.
• We revolutionize the compaction procedure with asyn- II. BACKGROUND
chronous file writes and fsyncs. With a kernel-space
thread simultaneously doing asynchronous disk I/Os in A. LSM-tree
the background, AisLSM’s user-space thread swiftly ini- RocksDB is a typical LSM-tree variant [3]. We take it
tiates the next compaction job and starts computations. to illustrate the architecture and operations of LSM-tree. As
The critical path of compaction is substantially shortened. shown by Figure 1, RocksDB is made of in-memory and
• We guarantee the durability of KV pairs. We retain the on-disk components, resembling a tiered tree-like structure.
fsync on every L0 SST file flushed from a memtable to RocksDB uses the skiplist ordered by keys as in-memory
build a solid foundation for durability, as SST files placed memtable. The memtable functions as a buffer. Once a Put
at lower levels than L0 can be viewed as descendants request arrives with KV pair, RocksDB inserts the KV pair
of L0 SST files. For each compaction, we track the into a mutable memtable after appending it to the tail of
generation dependency between input and output SST on-disk write-ahead-log (WAL). RocksDB sets a size limit
files. Input ones are not instantly deleted. We defer the for memtable (64MB by default). A fully filled memtable is
check-up of durability for output SST files of a past made immutable to serve search requests only and RocksDB
compaction until any one of them participates as input creates a new mutable one. By default, RocksDB maintains a
in the current compaction. If they are durable, we delete mutable memtable and an immutable one at runtime. It keeps
input files from which they were generated. a background user-space thread that transforms and flushes the
AisLSM is orthogonal to techniques like hot/cold data han- immutable memtable to be an SST file.
Throughput (MB/s)
Throughput (MB/s)
On the completion of flush, RocksDB persists the SST file 400 80
on the top on-disk level, i.e., L0 , via fsync and deletes 300 60
with conventional compaction on NVMe SSD (see Figure 4). ②Merge Sort
⑦ Async fsync submission
As we reschedule and overlap file write I/Os alongside CPU 1, 2, 6, 7, 8, 11, 12, 15, 16, 21, 23, 24
computations ( 2 3 and 5 in Figure 6), the compaction thread ⑥ Async fsync preparation
④ Write submission Captions
is unlikely to wait a long time for asynchronous file writes. 1,2
1, 2, 1,2, Async write but not
Asynchronous fsyncs for deferred durability. In con- 6,7 6,7 6,7 completed yet
…… ……
trast to waiting for asynchronous writes, AisLSM does not ③ Async write
⑤ Wait for
Async fsync but not
completed yet
16, 21 16, 21
stall to pend the completion of asynchronous fsync ( 8 23, 24
write
23, 24 Durable SST file
completion
in Figure 6). It also does not immediately remove old input
SST files like conventional compaction. AisLSM retains input
Fig. 7: An example of AisLSM’s compaction
SST files to back the durability of compacted KV pairs since
new output SST files are not synchronously persisted. AisLSM
defers the completion check-up of persisting new output SST previous q flushes or compactions (1 ≤ q ≤ p). AisLSM
files until they are chosen as input for future compaction. At synchronously makes L0 SST files durable. For any other file
that moment, the old SST files from which they are generated that stays at Ln (n ≥ 1) and is to participate in the current
could be safely discarded (see Section IV-D). compaction, AisLSM has tracked in which past compaction,
say, ζ, the file was submitted for asynchronous fsync. As
C. Inter-compaction Pipelining AisLSM calls asynchronous fsync for a compound batch
Because AisLSM leaves computations only on the critical of all SST files per compaction, it checks whether the entire
path of compaction, the compaction thread swiftly finishes the batch for ζ is already persisted or not. If so, AisLSM safely
current job and is soon ready to take the next compaction deletes SST files that had been used as input parents for ζ.
job. When the next compaction’s computations are ongoing on Otherwise, AisLSM synchronously waits for the completion
CPU, the storage device is handling fsync for the previous of asynchronous fsync, which, as observed in our empirical
compaction. In this way, AisLSM pipelines CPU computa- tests, is very rare in practice. Then AisLSM deletes parental
tion and disk I/Os for consecutive compactions. Conventional SST files for ζ. Those input SST files for ζ thus can be
compaction thread arranges computations and I/Os in a strictly viewed as the grandparents of output SST files that the current
serial sequence; hence, when I/Os are being processed, CPU compaction job is going to generate.
(3) (4)
core is staying idle in the meantime, and vice versa. AisLSM, Let us reuse Figure 5 for illustration. At T1 , L1 , L1 ,
(5) (0) (1)
however, neatly engages CPU core in computing for a newer and L1 are not durable yet and AisLSM keeps L0 , L0 ,
compaction while a kernel thread is simultaneously dealing (2) (4) (5)
and L1 until T2 . At T2 , as L1 and L1 participate in Com-
with storage I/Os for prior compaction. As a result, AisLSM paction 2 as input, AisLSM checks if the asynchronous fsync
embraces high utilizations for both CPU and storage. performed to three output files Compaction 1 generated is
(0) (1) (2)
completed or not. If so, it safely deletes L0 , L0 , and L1 .
D. Deferred Deletion upon Asynchronous fsyncs
For a flush that transforms an immutable memtable to an L0 V. I MPLEMENTATION
SST file, AisLSM synchronously calls fsync to persist the
file. This fsync builds a solid foundation for the durability We leverage io uring to implement AisLSM with RocksDB
of KV pairs. AisLSM views L0 SST files as the ancestors (Section V-A). We also comprehensively consider multiple
of all SST files staying at lower levels to be generated in aspects to optimize and enhance AisLSM (Section V-B).
afterward compactions. Each compaction can be viewed as a
A. Implementation of AisLSM
process of generating offspring output Ln+1 SST files from
parental input Ln and Ln+1 SST files (n ≥ 0). With regard Overview. We take RocksDB to prototype AisLSM while
to asynchronous fsyncs, AisLSM needs a time at which it the ideas of AisLSM can be applied to other LSM-tree vari-
checks up if offspring SST files have been concretely persisted ants. Doing asynchronous I/Os to revolutionize the procedure
and parental SST files can be accordingly deleted. As LSM- of compaction is orthogonal to other optimization techniques
tree steadily grows to more and more levels by compactions proposed to enhance LSM-tree. We mainly make use of the
and each SST file has a high likelihood of participating in io uring to implement AisLSM’s asynchronous writes and
a future compaction, AisLSM does the check-up when every fsyncs. Overall, the core functions of AisLSM add or change
compaction is about to load KV pairs from input SST files. about 1,624 lines of code (LOC) in RocksDB version 7.10.0.
AisLSM does the deferred check-up and deletion as follows. Compaction procedure. AisLSM follows RocksDB to 1)
A compaction takes in a set of p input SST files as parents flush an immutable memtable as an L0 SST file, 2) maintain
(p ≥ 1). Let us denote the set as Pb (|Pb| = p). All background threads for flush and compaction jobs, and 3) cal-
members of Pb used to be offspring SST files generated in culate scores to choose an overfilled level and input SST files
with key ranges overlapped for compaction. Figure 7 illus- the check-up. Searching KV pairs in this file is also unaffected
trates eight main steps with which AisLSM does with an com- since file system accommodates KV pairs in the OS’s buffer
paction. In these steps, AisLSM uses the io uring’s structures cache or storage device’s disk cache. AisLSM explicitly calls
such as uring_queue to collect data for each SST file. It fsync for a retry to fix the I/O error. In the worst case, it
calls io uring’s interfaces such as io_uring_prep_fsync regenerates and replaces that problematic file.
and io_uring_submit to prepare an asynchronous fsync Outlier SST files. In unusual cases, some SST files, once
and submit an I/O request, respectively. generated in a compaction, hardly participate in subsequent
Deferred check-up and deletion. At the beginning of a compactions, because the key ranges they cover might not
compaction, AisLSM checks if input parental SST files are be frequently used (i.e., outliers). AisLSM still ensures the
already durable ( 8 ). If so, it removes grandparental SST durability of such inactive outlier SST files. This is the other
files by inserting them into a collection vector that RocksDB reason why AisLSM submits one request for all SST files
has managed for the purpose of deleting files. The check-up generated in a compaction to schedule a compound asyn-
does not cost much time. For example, in dealing with the chronous fsync. As long as any one of them is to be involved
aforementioned test of putting 80GB of KV pairs, AisLSM in a future compaction, AisLSM checks if the asynchronous
spent overall 749.1 seconds, out of which all check-up actions fsync has been done for all relevant SST files. By doing
cost about 0.01 ms. Such a time cost is negligible. so, AisLSM avoids overlooking outliers. This also helps to
Version and state tracking. RocksDB has a Manifest file delete their parental SST files. In addition, there might be a
with an in-memory Version to record the change of SST very low likelihood that outliers form a batch and have no
files (see Figure 1). As AisLSM decouples the visibility and opportunity to be compacted again. AisLSM has tracked all
durability for Ln SST file (n ≥ 1), it tracks and updates the SST files asynchronously persisted with io uring. It schedules
state of each Ln SST file in the Manifest file and Version. a specific check-up in off-peak hours for such outliers.
Throughput (MB/s)
Throughput (MB/s)
Throughput (MB/s)
800 800
140 100 100 1400
Throughput (MB/s)
Throughput (MB/s)
Throughput (MB/s)
Throughput (MB/s)
125 1250
700
120 80 600
1200
100 80 1000
600
100 60 1000
75 500
750 400
80 60 40 800
50 400
500
60 40 20 300 600
200
25 250
40 200 0 400
0 20 0 0
20 64 128256512 1K 2K 4K 64 128256512 1K 2K 4K 100 64 128256512 1K 2K 4K 200 64 128256512 1K 2K 4K
0 Value Size (Bytes) 0 Value Size (Bytes) 0 Value Size (Bytes) 0 Value Size (Bytes)
64 128 256 512 1K 2K 4K 64 128 256 512 1K 2K 4K 64 128 256 512 1K 2K 4K 64 128 256 512 1K 2K 4K
Value Size (Bytes)
(a) fillrandom Value Size (Bytes)
(b) overwrite Value Size (Bytes)
(c) readseq (d)Value Size (Bytes)
readrandom
RocksDB ADOC TRIAD Rocks-bu SILK PhotonDB NobLSM AisLSM
Fig. 8: A comparison between LSM-tree variants on db bench’s fillrandom, overwrite, readseq, and readrandom
RocksDB by 14.1% higher throughput. For Rocks-bu, the use (Section VI-C2). As AisLSM shortens the critical path of com-
of io uring’s batch I/O only even degrades performance by paction, we measure how much it reduces the user-facing tail
4.2% compared against original RocksDB. latency (Section VI-C3). Regarding implementation and opti-
NobLSM is inferior to AisLSM with both fillrandom and mization techniques AisLSM contains, we further analyze the
overwrite workloads. The most significant gap between them contribution from each of them and figure out the root cause of
is 1.53× with fillrandom and 4KB values. The reason is performance boost for AisLSM (Section VI-C4). We next test
twofold. Firstly, NobLSM does not consider scheduling file if AisLSM works on another platform (Section VI-C5), with
write I/Os but still conducts them synchronously. Whereas, multiple compaction threads (Section VI-C6), and multiple
AisLSM asynchronously deals with both file write and fsync instances (Section VI-C7).
I/Os. Secondly, the time cost of checking if SST files are 1) Crash Consistency Test: To test the crash consistency of
asynchronously committed is non-trivial, particularly with fast AisLSM, we use the command ‘halt -f -p -n’ to sud-
NVMe SSD. For each compaction, NobLSM submits all denly power off Linux when writing KV pairs with db bench’s
output SST files for tracking with one customized system call fillrandom [26]. We repeat this test for five times successively
and later asks Ext4 to check for every file, resulting in multiple with RocksDB and AisLSM. We find that, besides KV pairs
system calls. NobLSM employs a global kernel-space table being appended to WAL, ones residing in SST files are
to record and track SST files. It is time-consuming to insert recoverable and retrievable for both RocksDB and AisLSM.
and query each file with the table, especially when many SST By default, they do not persist WALs with fsyncs. The main
files gradually accumulate due to continuous compactions. difference between them is that AisLSM does not wait for
Comparatively, for one asynchronous fsync, AisLSM both the durability of output SST files per compaction. It also does
submits a request and collects the result by calling respective not immediately delete input SST files. AisLSM flushes L0
io uring interfaces only once. In addition, AisLSM does not SST files to persist all KV pairs received from users. Only
rely on any file system or handcrafted Linux kernel, which after check-up will it delete SST files used as input for past
is a stark contrast to NobLSM. To sum up, AisLSM is much compactions. By tracking the generation dependency between
more effectual and portable than NobLSM. SST files, AisLSM guarantees any KV pair sinking down from
Thirdly, although LSM-tree is generally used to serve write- Ln to Ln+1 (n ≥ 0) is traceable and durable. All these jointly
intensive workloads, we have tested AisLSM’s capability in enable AisLSM’s crash recoverability.
serving read requests. As shown Figure 8c and Figure 8d, 2) Data Accessibility Test: We measure whether AisLSM
AisLSM is comparable to RocksDB and NobLSM, while some manages to find out all KV pairs under search. To do so,
LSM-tree variants exhibit dramatically low performance. For we first run fillrandom by engaging a foreground thread in
example, the throughputs of TRIAD and PhotonDB are just putting down KV pairs in 20GB with various value sizes
15.2% and 13.5% of AisLSM’s, respectively, with readrandom and then search keys with readrandom. We note that newer
and 1KB values. The change AisLSM incurs to read procedure RocksDB since version 6.2 no longer guarantees that db bench
is just to load KV pairs from a transiently non-durable SST always searches for a stored key. Instead, db bench randomly
file. This, however, does not affect the visibility of data or generates target keys and some of them might not be existent
the actions of locating a specific key. AisLSM thus achieves a in the LSM-tree. The randomization is based on a seed related
well balance between write and read performances while, for to the search time by default. RocksDB provides an option
instance, TRIAD keeps too many files at L0 with overlapped to fix the seed so that we can repeat the same readrandom
key ranges that are not friendly to searches [19, 51]–[53]. test case. Under equivalent search conditions, AisLSM and
RocksDB locate the same number of KV pairs, with about
C. Deep Dissection with AisLSM 63.2% of all searched keys found. This justifies that AisLSM’s
We have done various experiments to deeply evaluate data accessibility is identical to that of RocksDB.
AisLSM. We validate if it guarantees crash consistency (Sec- 3) Tail Latency: In addition to throughput, the tail latency
tion VI-C1). We test if it ensures the accessibility of KV pairs is another critical performance metric, especially for latency-
TABLE I: The tail latency (99P) for LSM-tree variants 250
Throughput (MB/s)
Throughput (MB/s)
100 RocksDB RocksDB
AisLSM 200 AisLSM
Value Rocks- Rocks- Photon- Nob- Ais- 80
size DB
ADOC TRIAD
bu
SILK
DB LSM LSM 60 150
64B 3.9 3.7 9.9 3.8 5.3 4.0 5.0 3.4 40 100
256B 8.5 8.9 11.3 8.4 5.0 10.2 8.1 4.0 20 50
1KB 20.1 21.4 1,030.3 25.6 12.4 20.7 26.6 10.3
0 0
4KB 1,932.0 2,033.7 1,225.0 1,167.0 9.3 1,584.0 2049.0 21.0 64 128 256 512 1K 2K 4K 1 2 4 8
Value Size (Bytes) Compaction Threads
Throughput (MB/s)
125 600 form compaction threads
100
140 75 700400 Fig. 10: The impacts of platform and compaction threads
Throughput (MB/s)
Throughput (MB/s)
120 600
50
Throughput (MB/s)
Throughput (MB/s)
100 500200
80 25 400 AisLSM 60 AisLSM
60
60 0 300 0 p2KVS p2KVS
40
64 128 256 512 1K 2K 4K 200 64 128 256 512 1K 2K 4K
Value Size (Bytes) Value Size (Bytes) 40 40
20 100
0 0 20 20
64 (a) fillrandom
128 256 512 1K 2K 4K 64 128 256 512 1K 2K 4K(b) readrandom
Value Size (Bytes) Value Size (Bytes)
0 0
RocksDB AisLSM-fsync AisLSM-interrupt AisLSM 64 128 256 512 1K 2K 4K 64 128 256 512 1K 2K 4K
Value Size (Bytes) Value Size (Bytes)
Time(s)
Time(s)
Time(s)
1000 1000
750 1000 600
1750 400
Throughput (MB/s)
Throughput (MB/s)
Throughput (MB/s)
Throughput (MB/s)
1400
500 1400 500 1500 500
1200
1000
1200
1000 250 1250 1000
800 200
0 800 0 1000 800 0 600 0
600 750 600 400
(a) Load-A 400 (b) Workload A 500 400 (c) Workload B (d) Workload C
250 200 200
200
0
1000 0 1500
0 0
1500 Value Size(Bytes) Value Size(Bytes) Value Size(Bytes) Value Size(Bytes)
3000
Time(s)
Time(s)
Time(s)
Time(s)
800
Throughput (MB/s)
Throughput (MB/s)
(MB/s)
Throughput (MB/s)
1750 1500
1000 3500
1000
1000 1500 600 1250 3000 2000
1250 800 2500
400 1000
500
Throughput
500 1000 600
750 2000 1000
750 200 400 1500
500 500 1000
0 250
0 200 250 0 500
0
0 0 0 0
(e) Workload F Value Size(Bytes) (f) Workload D Size(Bytes)
Value (g) LOAD-E
Value Size(Bytes) Value Size(Bytes)(h) Workload E
RocksDB ADOC TRIAD Rocks-bu SILK
PhotonDB NobLSM AisLSM-fsync AisLSM-interrupt AisLSM
approach for high throughput and scalability. p2 KVS [30] all its three variants mentioned in Section VI-C4. From these
is one representative using RocksDB as its instance. We diagrams we can obtain four observations.
also configure AisLSM as its instance. We set four instances Firstly, with write-dominant workloads, such as Load-A and
for both p2 KVS (RocksDB) and AisLSM and engaged each Load-E, AisLSM variants consistently yield higher perfor-
instance in serving 20 million requests with fillrandom and mance than state-of-the-art LSM-tree variants. For example,
overwrite workloads. Figure 11a and Figure 11b comparatively with Load-A, the time RocksDB, ADOC, TRIAD, Rocks-bu,
present the throughputs of AisLSM and p2 KVS on handling SILK, PhotonDB, and NobLSM spent is 1.8×, 1.8×, 2.5×,
two workloads, respectively. AisLSM has evident advantage 1.8×, 2.3×, 2.2×, and 1.1× that of AisLSM, respectively.
over p2 KVS on both workloads. For example, with 1KB This improvement is again accredited to the novel compaction
values, AisLSM yields 1.5× higher throughput than p2 KVS. procedure revolutionized by AisLSM with asynchronous I/O
AisLSM differs from p2 KVS in that an instance of the former model. The shortened time cost of compaction entails less
is more performant than an instance of RocksDB used by the stall time, thereby processing workloads with less service
latter. Given identical strategies for sharding and scheduling time. Secondly, as to read-dominant workloads, including
KV pairs among multiple instances, an instance of AisLSM workloads B, C, D, and E, AisLSM family yields comparable
processes requests at a much more prompt pace than p2 KV. or higher performance than state-of-the-art LSM-tree variants.
AisLSM is hence more efficient than p2 KVS. As to overwrite, For example, with workload B, the service time of RocksDB,
AisLSM outperforms p2 KVS in all value sizes. The highest ADOC, TRIAD, Rocks-bu, SILK, PhotonDB, and NobLSM
gap between them is 2.0× with 4KB values. is 1.4×, 1.1×, 3.5×, 1.2×, 1.2×, 1.9× and 1.2× that of
AisLSM, respectively. These results aligns with what we
D. Macro-benchmark have obtained with db bench’s readrandom. As illustrated
The Yahoo! Cloud Serving Benchmark (YCSB) [43] is by Figure 8d, TRIAD and PhotonDB have shown the lowest
a comprehensive, open-source tool that is widely used to throughputs. Thirdly, for a workload mixed with write and
evaluate the performance of LSM-tree-based KV stores. YCSB read requests, such as workloads A and F, AisLSM is still
provides six core workloads that emulate access patterns found performant over other LSM-tree variants. For example, to
in typical production environments. They are A (update heavy, finish these two workloads, original RocksDB demands 19.1%
50%/50% read/write), B (read mostly, 95%/5% read/write), C and 17.2% more time than AisLSM, respectively. Last but not
(read only, 100% read), D (read latest, 95%/5% read/insert), the least, AisLSM-interrupt, which is a variant of AisLSM
E (short ranges, 95%/5% range query/insert), and F (read- with interrupt-driven I/Os, is a bit faster than AisLSM in
modify-write, 50%/50% read-modify-write/read). These work- handling workloads with read requests. For example, with
loads are either write- or read-dominant, or a mix of write and workload E for range queries, AisLSM-interrupt working in
read requests. YCSB’s default size per KV pair is about 1KB. the buffered I/O mode cost 4.9% less time than AisLSM
We run YCSB workloads in the order of Load-A, A, B, C, F, working in the direct I/O mode. As mentioned, although the
D, Load-E and E by referring to previous works [25, 26, 56]. access speed of NVMe SSD is higher than legacy storage
We make Load-A and Load-E remove existing data in each devices, the use of OS’s or LSM-tree’s buffers would be
LSM-tree variant and put down 50 million of KV pairs. They helpful to serve read requests.
hence store roughly 50GB of data as the base. Every workload
carries ten million of requests to be served. Figure 12a to E. The Impact of Key’s Distribution
Figure 12h capture the service time for each LSM-tree in To observe the impact of the distribution of keys on the per-
such an order. Note that for AisLSM we show the results for formance of AisLSM, we conduct experiments with YCSB’s
1750
Throughput (MB/s)
Throughput (MB/s)
Throughput (MB/s)
Throughput (MB/s)
1400 1400 1000
1200
1500 3500
1200 (4KB by default). Then they tried to pipeline CPU computation
800
1000 2000 1250 1000
1000 3000
800 600
800
600 750
2500
600 and synchronous I/Os for consecutive blocks of every SST
1500
Time(s)
Throughput (MB/s)
Throughput (MB/s)
1750 1500
3500
500 1000
1500
1250 800
3000
1250
500
2500
1000
on multiple parallel storage devices for high I/O bandwidth
1000 0 600
750 0
2000
to catch up the computing speed of CPU. Whereas, this
750 A-Zipf 400
A-Unif A-Latest 500
F-Zipf F-Unif 1500
F-Latest
500 1000
250
0
200
0
250
0
500 demands changes across user- and kernel-spaces for storage
(a)
Value YCSB worklaod
Size(Bytes) ValueASize(Bytes) (b) YCSB worklaod 0F Value Size(Bytes)
Value Size(Bytes)
management. Additionally, their pipeline might not be stable
RocksDB ADOC TRIAD Rocks-bu SILK
PhotonDB NobLSM AisLSM-fsync AisLSM-interrupt AisLSM over time, since small data in one or few blocks is difficult
for CPU and storage to process at a steadily stable speed.
Fig. 13: A comparison between LSM-tree variants with dif- Comparatively, AisLSM takes effect at the granularity of SST
ferent distributions file in scores of megabytes and exploits existing storage stack
(e.g., io uring and NVMe SSD) for implementation. It shall
workloads under three distributions, i.e., Zipfian, uniform, and have higher performance, viability, and stability.
latest. Because of space limitation, we present the execution Not many researchers considered the impact of fsyncs
time with workloads A and F, as both of them contain a used in compactions on the performance of LSM-tree [25, 26].
50%/50% mix of write and read requests. As indicated by Fig- Prior to aforementioned NobLSM, Kim et al. [25] proposed
ure 13, AisLSM outperforms other LSM-tree variants under BoLT. BoLT produces one huge output SST file for each
Zipfian and uniform distributions. With the latest distribution, compaction and persists all compacted KV pars in one aggre-
AisLSM achieves a comparable performance than RocksDB. gated fsync. This reduces the performance penalty caused by
The reason is that, the latest distribution always chooses the multiple fsync calls on individual SST files, but the eventual
most recent data for operation. Write and read requests are large fsync still synchronously occurs on the critical path.
hence mostly satisfied at LSM-tree’s memtables and block Researchers also studied the processing speeds for compu-
cache. With regard to frequent merges upon updating the tations and I/Os for compaction. Some of them used FPGA
same keys repeatedly at the memtable level, compactions are to accelerate computations [58, 59]. The emerging storage
not largely triggered. The performance gain of revolutionized devices like NVMe SSD and persistent memory (pmem) also
compaction is consequently marginal for AisLSM. attracted wide attention [20, 47, 60]. For example, Chen et
al. [20] proposed SpanDB that jointly makes use of faster
VII. R ELATED W ORKS
NVMe SSD and ordinary slower SSD to suit the characteristics
We have quantitatively discussed and evaluated a few prior of WAL and SST files for storage. It also uses I/O polling
works in Section VI. Some techniques of them have been with NVMe SSD. In addition, SpanDB and p2 KVS share
proved to be useful to reduce the performance penalty caused similarity in dedicating separate foreground and background
by compactions. For example, the way TRIAD separates hot threads to serve user requests and do flush or compaction
and cold KV pairs was also considered by other works. Huang jobs, respectively. By doing so, SpanDB aims to overlap fore-
et al. [27] found that even a small number of frequently ground services with background jobs at runtime. Meanwhile,
updated hot KV pairs would quickly fill up SST files and cause researchers developed LSM-tree variants [6, 7, 13, 14, 29, 61]
more compaction jobs over time. They accordingly install an to leverage the non-volatility and byte-addressability of pmem.
auxiliary log to distinguish and handle hot data. Decoupling Though, the winding down of Intel’s Optane DC memory
values from keys is another technique that can effectively business [62, 63] may impact the deployment of them.
lower the frequency of compactions, since a pointer (location)
of each actual value, instead of the entire value, is stored in VIII. C ONCLUSION
SST file [19, 28, 29]. Some researchers proposed concurrent In this paper, we overhaul the compaction procedure of
compactions [14, 30, 57]. For example, p2 KVS [30] mentioned LSM-tree. The critical path of a compaction job is composed
in Section VI-C7 partitions KV space into independent spaces of CPU computations and disk I/Os. At runtime, LSM-tree’s
and manages multiple instances correspondingly. Such in- compaction thread synchronously waits for the completion of
stances concurrently schedule and perform compaction jobs. file write and fsync I/Os that a kernel thread is handling. We
There are also research works that leverage buffers to acceler- accordingly develop AisLSM that overlaps CPU computations
ate search performance for LSM-tree [51, 55]. AisLSM’s rev- (resp. user thread) with disk I/Os (resp. kernel thread) for
olutionized compaction is complementary to these techniques consecutive compactions and, particularly, performs disk I/Os
and they can collaboratively take effect for high performance. with an asynchronous model. AisLSM also decouples the
Foregoing designs mainly attempt at the granularity of visibility from durability for compacted KV pairs. With a
compaction jobs. Some researchers considered dissecting the deferred check-up and deletion strategy, AisLSM ensures that
internals of a compaction. For example, Zhang et al. [17] data stored in SST files is visible and durable. We thoroughly
tried to make use of the parallelism between CPU and I/O evaluate AisLSM. Experiments show that, by shortening the
device, like what we have done in this paper. However, they critical path of compaction, AisLSM highly boosts the perfor-
decomposed a compaction job in the granularity of blocks mance of LSM-tree and outperforms state-of-the-art designs.
R EFERENCES [18] Y. Chai, Y. Chai, X. Wang, H. Wei, N. Bao, and Y. Liang, “LDC: A
lower-level driven compaction method to optimize SSD-oriented key-
[1] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Bur- value stores,” in 2019 IEEE 35th International Conference on Data
rows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributed Engineering (ICDE), 2019, pp. 722–733.
storage system for structured data,” in 7th USENIX Symposium on [19] O. Balmau, D. Didona, R. Guerraoui, W. Zwaenepoel, H. Yuan,
Operating Systems Design and Implementation (OSDI), November 2006, A. Arora, K. Gupta, and P. Konka, “TRIAD: Creating synergies
pp. 205–218. between memory, disk and log in log structured key-value stores,” in
[2] S. Ghemawat and J. Dean, “LevelDB,” March 2011, https://github.com/ 2017 USENIX Annual Technical Conference (USENIX ATC 17). Santa
google/leveldb. Clara, CA: USENIX Association, July 2017, pp. 363–375. [Online].
[3] F. D. E. Team, “RocksDB,” October 2017, https://rocksdb.org/. Available: https://www.usenix.org/conference/atc17/technical-sessions/
presentation/balmau
[4] T. A. S. Foundation, “Apache HBase,” January 2009, https://hbase.
[20] H. Chen, C. Ruan, C. Li, X. Ma, and Y. Xu, “SpanDB: A fast,
apache.org/.
Cost-Effective LSM-tree based KV store on hybrid storage,” in 19th
[5] A. Lakshman and P. Malik, “Cassandra: A decentralized structured
USENIX Conference on File and Storage Technologies (FAST 21).
storage system,” SIGOPS Oper. Syst. Rev., vol. 44, no. 2, p. 35–40, Apr.
USENIX Association, Feb. 2021, pp. 17–32. [Online]. Available:
2010. [Online]. Available: https://doi.org/10.1145/1773912.1773922
https://www.usenix.org/conference/fast21/presentation/chen-hao
[6] S. Kannan, N. Bhat, A. Gavrilovska, A. Arpaci-Dusseau, and R. Arpaci- [21] J. Chu, Y. Tu, Y. Zhang, and C. Weng, “Latte: A native table engine on
Dusseau, “Redesigning LSMs for nonvolatile memory with NoveLSM,” NVMe storage,” in 2020 IEEE 36th International Conference on Data
in 2018 USENIX Annual Technical Conference (USENIX ATC 18). Engineering (ICDE), 2020, pp. 1225–1236.
Boston, MA: USENIX Association, Jul. 2018, pp. 993–1005. [Online]. [22] J. Axboe, “Efficient io with io uring,” October 2019, https://kernel.dk/
Available: https://www.usenix.org/conference/atc18/presentation/kannan io uring.pdf.
[7] O. Kaiyrakhmet, S. Lee, B. Nam, S. H. Noh, and Y. ri Choi, “SLM-DB: [23] J. Corbet, “The rapid growth of io uring,” January 2020, https://lwn.net/
Single-level key-value store with persistent memory,” in 17th USENIX Articles/810414/.
Conference on File and Storage Technologies (FAST 19). Boston, MA: [24] B. Mottahedeh, “An introduction to the io uring asyn-
USENIX Association, Feb. 2019, pp. 191–205. [Online]. Available: chronous i/o framework,” https://blogs.oracle.com/linux/post/
https://www.usenix.org/conference/fast19/presentation/kaiyrakhmet an-introduction-to-the-io-uring-asynchronous-io-framework, May
[8] B. Lepers, O. Balmau, K. Gupta, and W. Zwaenepoel, “KVell: The 2020.
design and implementation of a fast persistent key-value store,” in [25] D. Kim, C. Park, S.-W. Lee, and B. Nam, “BoLT: Barrier-optimized
Proceedings of the 27th ACM Symposium on Operating Systems LSM-tree,” in Proceedings of the 21st International Middleware
Principles, ser. SOSP ’19. New York, NY, USA: Association Conference, ser. Middleware ’20. New York, NY, USA: Association
for Computing Machinery, 2019, pp. 447–461. [Online]. Available: for Computing Machinery, 2020, p. 119–133. [Online]. Available:
https://doi.org/10.1145/3341301.3359628 https://doi.org/10.1145/3423211.3425676
[9] K. Ren, Q. Zheng, J. Arulraj, and G. Gibson, “SlimDB: A space- [26] H. Dang, C. Ye, Y. Hu, and C. Wang, “NobLSM: An LSM-tree with
efficient key-value storage engine for semi-sorted data,” Proc. VLDB non-blocking writes for SSDs,” in Proceedings of the 59th ACM/IEEE
Endow., vol. 10, no. 13, pp. 2037–2048, sep 2017. [Online]. Available: Design Automation Conference, ser. DAC ’22. New York, NY, USA:
https://doi.org/10.14778/3151106.3151108 Association for Computing Machinery, 2022, p. 403–408. [Online].
[10] R. Wang, J. Wang, P. Kadam, M. Tamer Özsu, and W. G. Aref, “dLSM: Available: https://doi.org/10.1145/3489517.3530470
An LSM-based index for memory disaggregation,” in 2023 IEEE 39th [27] K. Huang, Z. Jia, Z. Shen, Z. Shao, and F. Chen, “Less is more: De-
International Conference on Data Engineering (ICDE), April 2023, pp. amplifying I/Os for key-value stores with a log-assisted lsm-tree,” in
2835–2849. 2021 IEEE 37th International Conference on Data Engineering (ICDE),
[11] H. Saxena, L. Golab, S. Idreos, and I. F. Ilyas, “Real-time LSM-trees April 2021, pp. 612–623.
for HTAP workloads,” in 2023 IEEE 39th International Conference on [28] L. Lu, T. S. Pillai, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau,
Data Engineering (ICDE), April 2023, pp. 1208–1220. “WiscKey: Separating keys from values in SSD-conscious storage,”
[12] O. Balmau, F. Dinu, W. Zwaenepoel, K. Gupta, R. Chandhiramoorthi, in 14th USENIX Conference on File and Storage Technologies (FAST
and D. Didona, “SILK: Preventing latency spikes in Log-Structured 16). Santa Clara, CA: USENIX Association, February 2016, pp.
merge Key-Value stores,” in 2019 USENIX Annual Technical Conference 133–148. [Online]. Available: https://www.usenix.org/conference/fast16/
(USENIX ATC 19). Renton, WA: USENIX Association, Jul. 2019, pp. technical-sessions/presentation/lu
753–766. [Online]. Available: https://www.usenix.org/conference/atc19/ [29] W. Kim, C. Park, D. Kim, H. Park, Y. ri Choi, A. Sussman,
presentation/balmau and B. Nam, “ListDB: Union of Write-Ahead logs and persistent
[13] T. Yao, Y. Zhang, J. Wan, Q. Cui, L. Tang, H. Jiang, C. Xie, SkipLists for incremental checkpointing on persistent memory,”
and X. He, “MatrixKV: Reducing write stalls and write amplification in 16th USENIX Symposium on Operating Systems Design and
in LSM-tree based KV stores with matrix container in NVM,” Implementation (OSDI 22). Carlsbad, CA: USENIX Association,
in 2020 USENIX Annual Technical Conference (USENIX ATC 20). Jul. 2022, pp. 161–177. [Online]. Available: https://www.usenix.org/
USENIX Association, July 2020, pp. 17–31. [Online]. Available: conference/osdi22/presentation/kim
https://www.usenix.org/conference/atc20/presentation/yao [30] Z. Lu, Q. Cao, H. Jiang, S. Wang, and Y. Dong, “p2 KVS: A
[14] Y. Chen, Y. Lu, F. Yang, Q. Wang, Y. Wang, and J. Shu, “FlatStore: portable 2-dimensional parallelizing framework to improve scalability
An efficient log-structured key-value storage engine for persistent of key-value stores on SSDs,” in Proceedings of the Seventeenth
memory,” in Proceedings of the Twenty-Fifth International Conference European Conference on Computer Systems, ser. EuroSys ’22. New
on Architectural Support for Programming Languages and Operating York, NY, USA: Association for Computing Machinery, 2022, pp.
Systems, ser. ASPLOS ’20. New York, NY, USA: Association 575–591. [Online]. Available: https://doi.org/10.1145/3492321.3519567
for Computing Machinery, 2020, p. 1077–1091. [Online]. Available: [31] PingCAP-Hackthon2019-Team17, “Io-uring speed the rocksdb & tikv,”
https://doi.org/10.1145/3373376.3378515 October 2019, https://openinx.github.io/ppt/io-uring.pdf.
[15] A. Mahajan, “Write Stalls for RocksDB,” October 2021, https://github. [32] A. Cloud, “PhotonLibOS,” July 2022, https://github.com/alibaba/
com/facebook/rocksdb/wiki/Write-Stalls. PhotonLibOS.
[16] J. Yu, S. H. Noh, Y. ri Choi, and C. J. Xue, “ADOC: Automatically [33] L. Torvalds, “Re: [patch 09/13] aio: add support for async openat(),”
harmonizing dataflow between components in Log-Structured Key- January 2016, https://lwn.net/Articles/671657/.
Value stores for improved performance,” in 21st USENIX Conference [34] Intel, “Storage performance development kit,” January 2023, https:
on File and Storage Technologies (FAST 23). Santa Clara, CA: //spdk.io/.
USENIX Association, Feb. 2023, pp. 65–80. [Online]. Available: [35] J. Yang, D. B. Minturn, and F. Hady, “When poll is better than interrupt,”
https://www.usenix.org/conference/fast23/presentation/yu in Proceedings of the 10th USENIX Conference on File and Storage
[17] Z. Zhang, Y. Yue, B. He, J. Xiong, M. Chen, L. Zhang, and N. Sun, Technologies, ser. FAST’12. USA: USENIX Association, Feb 2012,
“Pipelined compaction for the LSM-tree,” in 2014 IEEE 28th Interna- p. 3.
tional Parallel and Distributed Processing Symposium, 2014, pp. 777– [36] H.-J. Kim, Y.-S. Lee, and J.-S. Kim, “NVMeDirect: A user-space I/O
786. framework for application-specific optimization on NVMe SSDs,” in
Proceedings of the 8th USENIX Conference on Hot Topics in Storage York, NY, USA: Association for Computing Machinery, 2021, p.
and File Systems, ser. HotStorage’16. USA: USENIX Association, 280–294. [Online]. Available: https://doi.org/10.1145/3477132.3483593
2016, p. 41–45. [51] F. Wu, M.-H. Yang, B. Zhang, and D. H. Du, “AC-Key: Adaptive
[37] B. Peng, H. Zhang, J. Yao, Y. Dong, Y. Xu, and H. Guan, caching for LSM-based key-value stores,” in 2020 USENIX Annual
“MDev-NVMe: A NVMe storage virtualization solution with mediated Technical Conference (USENIX ATC 20). USENIX Association,
Pass-Through,” in 2018 USENIX Annual Technical Conference July 2020, pp. 603–615. [Online]. Available: https://www.usenix.org/
(USENIX ATC 18). Boston, MA: USENIX Association, Jul. 2018, pp. conference/atc20/presentation/wu-fenggang
665–676. [Online]. Available: https://www.usenix.org/conference/atc18/ [52] W. Zhong, C. Chen, X. Wu, and S. Jiang, “REMIX: Efficient range
presentation/peng query for LSM-trees,” in 19th USENIX Conference on File and
[38] L. of the io uring, “io uring setup,” https://unixism.net/loti/ref-iouring/ Storage Technologies (FAST 21). USENIX Association, Feb. 2021, pp.
io uring setup.html, June 2020. 51–64. [Online]. Available: https://www.usenix.org/conference/fast21/
[39] Y. Won, J. Jung, G. Choi, J. Oh, S. Son, J. Hwang, and S. Cho, “Barrier- presentation/zhong
enabled IO stack for flash storage,” in Proceedings of the 16th USENIX [53] S. Sarkar, N. Dayan, and M. Athanassoulis, “The LSM design space and
Conference on File and Storage Technologies, ser. FAST’18. USA: its read optimizations,” in 2023 IEEE 39th International Conference on
USENIX Association, 2018, p. 211–226. Data Engineering (ICDE), April 2023, pp. 3578–3584.
[40] T. kernel development community, “Multi-queue block IO queueing [54] J. Liang and Y. Chai, “CruiseDB: An LSM-tree key-value store with both
mechanism (blk-mq),” https://www.kernel.org/doc/html/latest/block/ better tail throughput and tail latency,” in 2021 IEEE 37th International
blk-mq.html#multi-queue-block-io-queueing-mechanism-blk-mq. Conference on Data Engineering (ICDE), 2021, pp. 1032–1043.
[41] S. Roesch, “Re: [PATCH v7 00/15] io-uring/xfs: support [55] D. Teng, L. Guo, R. Lee, F. Chen, S. Ma, Y. Zhang, and X. Zhang,
async buffered writes,” https://lore.kernel.org/linux-mm/ “LSbM-tree: Re-enabling buffer caching in data management for mixed
[email protected]/, June 2022. reads and writes,” in 2017 IEEE 37th International Conference on
[42] P. Administrator, “Linux 5.20 to support async buffered writes Distributed Computing Systems (ICDCS), 2017, pp. 68–79.
for XFS + io uring for big performance boost,” https://www. [56] P. Raju, R. Kadekodi, V. Chidambaram, and I. Abraham, “PebblesDB:
phoronix.com/forums/forum/software/general-linux-open-source/ Building key-value stores using fragmented log-structured merge
1330236-linux-5-20-to-support-async-buffered-writes-for-xfs-io trees,” in Proceedings of the 26th Symposium on Operating Systems
uring-for-big-performance-boost, June 2022. Principles, ser. SOSP ’17. New York, NY, USA: Association
[43] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, for Computing Machinery, 2017, p. 497–514. [Online]. Available:
“Benchmarking cloud serving systems with YCSB,” in Proceedings of https://doi.org/10.1145/3132747.3132765
the 1st ACM Symposium on Cloud Computing, ser. SoCC ’10. New [57] H. Huang and S. Ghandeharizadeh, “Nova-LSM: A distributed,
York, NY, USA: ACM, 2010, pp. 143–154. component-based LSM-tree key-value store,” in Proceedings of
[44] B. Chen, “200 lines of code to rewrite the 600’000 lines RocksDB the 2021 International Conference on Management of Data, ser.
into a coroutine program,” December 2022, https://github.com/facebook/ SIGMOD/PODS ’21. New York, NY, USA: Association for
rocksdb/issues/11017. Computing Machinery, 2021, p. 749–763. [Online]. Available: https:
[45] Y. Kang, X. Huang, S. Song, L. Zhang, J. Qiao, C. Wang, J. Wang, and //doi.org/10.1145/3448016.3457297
J. Feinauer, “Separation or not: On handing out-of-order time-series data [58] T. Zhang, J. Wang, X. Cheng, H. Xu, N. Yu, G. Huang, T. Zhang, D. He,
in leveled LSM-tree,” in 2022 IEEE 38th International Conference on F. Li, W. Cao, Z. Huang, and J. Sun, “FPGA-accelerated compactions
Data Engineering (ICDE), 2022, pp. 3340–3352. for LSM-based key-value store,” in 18th USENIX Conference on File
[46] X. Wang, P. Jin, B. Hua, H. Long, and W. Huang, “Reducing write and Storage Technologies (FAST 20). Santa Clara, CA: USENIX
amplification of LSM-tree with block-grained compaction,” in 2022 Association, February 2020, pp. 225–237. [Online]. Available:
IEEE 38th International Conference on Data Engineering (ICDE), 2022, https://www.usenix.org/conference/fast20/presentation/zhang-teng
pp. 3119–3131. [59] X. Sun, J. Yu, Z. Zhou, and C. J. Xue, “FPGA-based compaction
[47] Y. Zhong, Z. Shen, Z. Yu, and J. Shu, “Redesigning high-performance engine for accelerating LSM-tree key-value stores,” in 2020 IEEE 36th
LSM-based key-value stores with persistent CPU caches,” in 2023 IEEE International Conference on Data Engineering (ICDE), 2020, pp. 1261–
39th International Conference on Data Engineering (ICDE), 2023, pp. 1272.
1098–1111. [60] Y. Zhang, H. Hu, X. Zhou, E. Xie, H. Ren, and L. Jin, “PM-Blade: A
[48] Y.-S. Chang, Y. Hsiao, T.-C. Lin, C.-W. Tsao, C.-F. Wu, Y.-H. Chang, persistent memory augmented LSM-tree storage for database,” in 2023
H.-S. Ko, and Y.-F. Chen, “Determinizing crash behavior with a IEEE 39th International Conference on Data Engineering (ICDE), April
verified Snapshot-Consistent flash translation layer,” in 14th USENIX 2023, pp. 3363–3375.
Symposium on Operating Systems Design and Implementation (OSDI [61] L. Benson, H. Makait, and T. Rabl, “Viper: An efficient hybrid
20). USENIX Association, Nov. 2020, pp. 81–97. [Online]. Available: PMem-DRAM key-value store,” Proc. VLDB Endow., vol. 14,
https://www.usenix.org/conference/osdi20/presentation/chang no. 9, pp. 1544–1556, may 2021. [Online]. Available: https:
[49] H. Li, M. L. Putra, R. Shi, X. Lin, G. R. Ganger, and H. S. Gunawi, //doi.org/10.14778/3461535.3461543
“LODA: A host/device co-design for strong predictability contract [62] P. Alcorn, “Intel kills Optane memory business, pays $559
on modern flash storage,” in Proceedings of the ACM SIGOPS 28th million inventory write-off,” https://www.tomshardware.com/news/
Symposium on Operating Systems Principles, ser. SOSP ’21. New intel-kills-optane-memory-business-for-good, August 2022.
York, NY, USA: Association for Computing Machinery, 2021, p. [63] S. Zhong, C. Ye, G. Hu, S. Qu, A. Arpaci-Dusseau, R. Arpaci-Dusseau,
263–279. [Online]. Available: https://doi.org/10.1145/3477132.3483573 and M. Swift, “MadFS: Per-file virtualization for userspace persistent
[50] J. Park and Y. I. Eom, “FragPicker: A new defragmentation tool for memory filesystems,” in Proceedings of the 21st USENIX Conference
modern storage devices,” in Proceedings of the ACM SIGOPS 28th on File and Storage Technologies, ser. FAST’23. USA: USENIX
Symposium on Operating Systems Principles, ser. SOSP ’21. New Association, Feb. 2023, p. 1–15.