Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
56 views14 pages

Aislsm: Revolutionizing The Compaction With Asynchronous I/Os For Lsm-Tree

The document introduces AisLSM, a new variant of the log-structured merge tree (LSM-tree) that utilizes asynchronous I/O to enhance the compaction process, significantly reducing stalls in foreground services. By decoupling CPU computations from synchronous I/O operations, AisLSM improves the performance of RocksDB by up to 2.14 times while ensuring data durability and accessibility. The study highlights the inefficiencies of traditional synchronous compactions and demonstrates the advantages of asynchronous operations in modern storage systems.

Uploaded by

kazarayoub2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views14 pages

Aislsm: Revolutionizing The Compaction With Asynchronous I/Os For Lsm-Tree

The document introduces AisLSM, a new variant of the log-structured merge tree (LSM-tree) that utilizes asynchronous I/O to enhance the compaction process, significantly reducing stalls in foreground services. By decoupling CPU computations from synchronous I/O operations, AisLSM improves the performance of RocksDB by up to 2.14 times while ensuring data durability and accessibility. The study highlights the inefficiencies of traditional synchronous compactions and demonstrates the advantages of asynchronous operations in modern storage systems.

Uploaded by

kazarayoub2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

AisLSM: Revolutionizing the Compaction with

Asynchronous I/Os for LSM-tree


Yanpeng Hu, Li Zhu, Lei Jia, and Chundong Wang
ShanghaiTech University, Shanghai, China

Abstract—The log-structured merge tree (LSM-tree) is widely loads [12]–[14]. LSM-tree intentionally does flushes and com-
employed to build key-value (KV) stores. LSM-tree organizes pactions in the background. However, if a few memtables are
multiple levels in memory and on disk. The compaction of LSM- waiting for flush or many SST files are pending compaction,
tree, which is used to redeploy KV pairs between on-disk levels
in the form of SST files, severely stalls its foreground service. We LSM-tree stalls foreground service [15]–[18]. Such stalls incur
arXiv:2307.16693v1 [cs.DB] 31 Jul 2023

overhaul and analyze the procedure of compaction. Writing and significant performance penalty [12, 16, 19]. We have taken
persisting files with fsyncs for compacted KV pairs are time- RocksDB [3] for a quantitative study. We conduct experiments
consuming and, more important, occur synchronously on the by running it on an NVMe solid-state drive (SSD). It spends
critical path of compaction. The user-space compaction thread overall 1,399.9 seconds in finishing Put requests for 80GB
of LSM-tree stays waiting for completion signals from a kernel-
space thread that is processing file write and fsync I/Os. with 16B/1KB per KV pair and four foreground threads.
We accordingly design a new LSM-tree variant named AisLSM However, it stalls for 1,179.1 seconds, i.e., 84.2% of total
with an asynchronous I/O model. In short, AisLSM conducts time. By forcefully disabling compactions, the throughput of
asynchronous writes and fsyncs for SST files generated in a RocksDB increases by 5.7×. This substantial leap motivates
compaction and overlaps CPU computations with disk I/Os for us to shorten the critical path of compaction for LSM-tree.
consecutive compactions. AisLSM tracks the generation depen-
dency between input and output files for each compaction and As mentioned, a compaction is composed of three re-
utilizes a deferred check-up strategy to ensure the durability of peated actions, i.e., CPU computation (mainly for merge-sort),
compacted KV pairs. We prototype AisLSM with RocksDB and file write, and fsync. LSM-tree synchronously proceeds
io uring. Experiments show that AisLSM boosts the performance them [14, 17]. Our study shows that CPU computations, file
of RocksDB by up to 2.14×, without losing data accessibility and writes, and fsyncs contribute 47.7%, and 6.3%, 46.0% in
consistency. It also outperforms state-of-the-art LSM-tree vari-
ants with significantly higher throughput and lower tail latency. the time cost per compaction on average, respectively. In
Index Terms—LSM-tree, Asynchronous I/O, Compaction each compaction, RocksDB’s user thread runs on a CPU core
for computations to prepare sorted KV pairs and then keeps
I. I NTRODUCTION waiting for the completion of file write and fsync which,
The log-structured merge tree (LSM-tree) gains wide pop- however, are conducted by a kernel thread. If we avoid waiting
ularity in building key-value (KV) stores [1]–[11]. LSM-tree on the critical path of compaction but asynchronously handle
appends arriving KV pairs to an on-disk log and inserts them I/Os, the performance of LSM-tree should be accelerated.
into in-memory memtables, each of which is a structure (e.g., Assuming that a kernel thread is processing I/Os for the current
skiplist) ordered by keys. Once a memtable becomes full compaction job, LSM-tree’s user thread can simultaneously
according to a preset size limit, LSM-tree makes it immutable. compute for the next compaction job. This summarizes our
LSM-tree transforms an immutable memtable to a sorted aim in this paper, i.e., orchestrating CPU computations (resp.
string table (SST) file and puts it onto the tree’s top level on user thread) and disk I/Os (resp. kernel thread) to revolutionize
disk, i.e., L0 . This is referred to as flush1 . LSM-tree defines a compaction and optimize LSM-tree.
capacity limit for each on-disk Ln (n ≥ 0) to hold a number Today’s hardware and software jointly provide a promising
of SST files. The limit of Ln+1 is usually ten times that of Ln . opportunity for us to do so. For hardware, compared to con-
When Ln is full, LSM-tree initiates a compaction, in which ventional hard disk drive (HDD) or SATA SSD, NVMe SSD
LSM-tree merge-sorts KV pairs residing in selected Ln and enables higher processing speed [20, 21]. The aforementioned
Ln+1 SST files that have key ranges overlapped ( 1 ), writes percentages for file write and fsync I/Os with NVMe SSD
sorted KV pairs into a new Ln+1 SST file ( 2 ), and persists roughly match that of CPU computations ( 6.3%+46.0%
47.7% ), such
the file with fsync ( 3 ). LSM-tree repeats 1 to 3 until all that forthcoming computations are unlikely to be blocked by
KV pairs are persisted in output SST files. Then it deletes uncompleted asynchronous I/Os that have been scheduled but
input SST files and completes the compaction. not finished yet. As to software, researchers have subsumed
The foreground operations of logging and insertion with legacy Linux native AIO with the io uring framework [22]–
memtable make LSM-tree appealing for write-intensive work- [24]. The io uring works in the kernel space with high
efficiency and capacious interfaces for asynchronous I/Os.
Y. Hu and L. Zhu contribute equally to this work. C. Wang is the Not much attention has been paid to the impact of syn-
corresponding author (cd [email protected]).
1 Researchers also use ‘flush’ to describe a program calling fsync to write chronous I/Os on LSM-tree. Kim et al. [25] noticed that
down a file, which we refer to as ‘persist’ for distinguishing in this paper. persisting data in a batched fsync is more efficient than doing
so for multiple batches. They designed BoLT that aggregates PUT
SSTableholding
compacted KV pairs in a huge SST file for one fsync. X, Y
keys range in X, Y.
However, BoLT still retains fsyncs on the critical path of
compaction. Dang et al. [26] proposed NobLSM that partly re- Immutable
Memtable
Flush Memtable
places fsyncs with the periodical commits of Ext4. Though, Memory
NobLSM lacks portability as it relies on Ext4 mounted in Persistent Storage
the data=ordered mode. Worse, it demands handcrafted WAL L0 4, 69 23, 102
Compaction
customization in the kernel of operating system (OS). CURRENT
L1 5, 89 97, 201
When leveraging asynchronous I/Os to revolutionize the MANIFEST
LOG … … Compaction
compaction, we shall neither keep fsync on the critical path
Ln 13, 59 76, 102 123, 256
nor incur changes to system software. In addition, as conven-
tional LSM-tree employs synchronous I/Os, all compacted KV Fig. 1: The Architecture of RocksDB
pairs become both visible for reading and durable for recovery
at the end of a compaction. In other words, these KV pairs
dling [19, 27], key-value separation [19, 28, 29], and concur-
simultaneously gain the visibility and durability. Whereas,
rent or pipelined compactions [14, 17, 30] proposed in previous
asynchronous I/Os introduce uncertainty to such properties.
works. The shortened critical path of compaction AisLSM
With foregoing observations and concerns, we propose an brings about complements those techniques. We prototype
LSM-tree variant named AisLSM. AisLSM employs asyn- AisLSM by modifying RocksDB with io uring. Experiments
chronous file write and fsync for each new output SST file confirm that AisLSM dramatically boosts the performance of
that a compaction generates from existing input SST files, RocksDB, with up to 2.14× throughput. It also significantly
thereby removing synchronous I/Os from the critical path. outperforms state-of-the-art designs, including ADOC [16],
It calls io uring interfaces to do so, without changing the TRIAD [19], Rocks-bu [31], SILK [12], PhotonDB [32],
OS’s kernel, file system, or storage device. The completion and NobLSM [26]. For example, in a write-intensive test,
of asynchronously writing an output SST file makes the file’s the tail latency of AisLSM is 48.8%, 51.9%, 99.0%, 59.8%,
KV pairs steadily accessible in the OS’s page cache or device’s 16.9%, 50.4%, and 61.4% less than that of RocksDB, ADOC,
disk cache, so the visibility of KV pairs is enabled. The output TRIAD, Rocks-bu, SILK, PhotonDB, and NobLSM, respec-
file may not be durable yet. However, provided that any input tively. Such a substantial gap justifies the efficacy of AisLSM’s
file in which KV pairs have stayed is durable, the durability asynchronous I/O model for compaction. We also verify that
of KV pairs is still guaranteed. AisLSM retains durable input AisLSM has no loss of accessibility or recoverability for data.
files to protect the durability of compacted KV pairs until it
The remainder of this paper is organized as follows. In
perceives the durability of output files. Concretely, AisLSM
Section II we present the background of LSM-tree and asyn-
decouples the durability from visibility for compacted KV
chronous I/Os. We brief our motivational study in Section III.
pairs. The main contributions of this paper are as follows.
We detail the design and implementation of AisLSM in
• We analytically overhaul the compaction procedure for Sections IV and V, respectively. We quantitatively evaluate
LSM-tree. We quantitatively reveal the significant impact AisLSM in Section VI. We compare AisLSM to related works
of synchronous writes and fsyncs employed in each in Section VII and conclude the paper in Section VIII.
compaction on the performance of LSM-tree.
• We revolutionize the compaction procedure with asyn- II. BACKGROUND
chronous file writes and fsyncs. With a kernel-space
thread simultaneously doing asynchronous disk I/Os in A. LSM-tree
the background, AisLSM’s user-space thread swiftly ini- RocksDB is a typical LSM-tree variant [3]. We take it
tiates the next compaction job and starts computations. to illustrate the architecture and operations of LSM-tree. As
The critical path of compaction is substantially shortened. shown by Figure 1, RocksDB is made of in-memory and
• We guarantee the durability of KV pairs. We retain the on-disk components, resembling a tiered tree-like structure.
fsync on every L0 SST file flushed from a memtable to RocksDB uses the skiplist ordered by keys as in-memory
build a solid foundation for durability, as SST files placed memtable. The memtable functions as a buffer. Once a Put
at lower levels than L0 can be viewed as descendants request arrives with KV pair, RocksDB inserts the KV pair
of L0 SST files. For each compaction, we track the into a mutable memtable after appending it to the tail of
generation dependency between input and output SST on-disk write-ahead-log (WAL). RocksDB sets a size limit
files. Input ones are not instantly deleted. We defer the for memtable (64MB by default). A fully filled memtable is
check-up of durability for output SST files of a past made immutable to serve search requests only and RocksDB
compaction until any one of them participates as input creates a new mutable one. By default, RocksDB maintains a
in the current compaction. If they are durable, we delete mutable memtable and an immutable one at runtime. It keeps
input files from which they were generated. a background user-space thread that transforms and flushes the
AisLSM is orthogonal to techniques like hot/cold data han- immutable memtable to be an SST file.
Throughput (MB/s)

Throughput (MB/s)
On the completion of flush, RocksDB persists the SST file 400 80
on the top on-disk level, i.e., L0 , via fsync and deletes 300 60

corresponding WAL. RocksDB defines that on-disk levels have 200 40

exponentially increasing capacity limits. The limit of Ln+1 is 100 20

ten times that of Ln (n ≥ 0). RocksDB employs compactions 0


HDD SATA SSD NVMe SSD
0
HDD SATA SSD NVMe SSD
to control each level’s capacity. Among all levels, RocksDB Compaction and Flush Enabled
Compaction Disabled Only RocksDB (XFS) NobLSM (ext4)
selects one Ln that maximally exceeds the level’s capacity Compaction and Flush Disabled RocksDB (ext4)

limit for compaction. It firstly loads KV pairs from selected Ln


(a) RocksDB with and without (b) RocksDB and NobLSM run-
and Ln+1 SST files that have key ranges overlapped. It merge- compaction as well as flush ning on XFS and Ext4
sorts, writes, and persists them in new Ln+1 SST files. After
deleting input parental SST files, KV pairs are redeployed Fig. 2: A study with RocksDB’s compaction and flush
into output offspring SST files at Ln+1 . We would overhaul
the compaction with quantitative analysis in Section III. As interrupt handling, thereby minimizing the overhead of context
shown in Figure 1, since L0 SST files are transformed from switch on I/O path. As the io uring framework has been poised
memtables that have directly received users’ KV pairs in the in Linux kernel, there are explorations on how to efficiently
foreground over time, L0 SST files naturally have key ranges apply io uring with NVMe SSD for high performance [38].
overlapped in between. Compactions consequently make SST
III. A M OTIVATIONAL S TUDY
files at any lower level below L0 have no such kind of overlaps.
We take RocksDB for a quantitative study. We set it up on a
B. Asynchronous I/Os machine with an Intel Core i9-9900K CPU, 64GB DRAM as
The io uring. Linux kernel has had native support for main memory, and three storage devices (HDD, SATA SSD,
asynchronous I/O (AIO) for years. However, the AIO frame- and NVMe SSD). More details of the machine can be found
work exhibits a few defects. Linus Torvalds once claimed that in Section VI. With the study, we aim to figure out the impact
it is a horrible ad-hoc design [33]. For example, it only works of compactions on LSM-tree’s performance and further locate
with the direct I/O mode. It may also show non-deterministic the realistic bottleneck on the critical path of compaction.
behavior that ends up blocking under some circumstances [22]. O1: Compactions cause severe stalls to suspend the
In Linux kernel 5.1, Jens Axboe positioned the io uring foreground service of LSM-tree. By default, RocksDB may
framework to subsume AIO [22]–[24]. The io uring provides stall due to a lot of SST files waiting for compaction or many
low-latency and feature-rich interfaces for programmers who memtables pending flushes. We have done a test by leveraging
need asynchronous I/Os and prefer the kernel to do so. This is RocksDB’s embedded db bench to put down overall 80GB
a stark contrast to SPDK that functions as a user-mode library data that are being issued by four foreground threads emulating
driver with user-space file system [20, 21, 24, 34]. Using the four users, with 1KB per KV pair under the fillrandom work-
io uring, applications benefit from running on top of a mature load. On NVMe SSD, this test ran for 1,399.9 seconds while
kernel-space file system in both buffered and direct I/O modes, stalls happened for 1,179.1 seconds. We have measured the
which entitles io uring higher flexibility and viability. time spent to fill up a memtable without and with compactions.
RocksDB and io uring. RocksDB’s developers have al- The presence of compactions made the user-facing latency of
ready considered io uring to speed up its MultiGet function. inserting KV pairs into memtable largely increased by 12.4×.
When RocksDB receives a read request of reading multiple In order to separately analyze the impact of compaction from
KV pairs for a user, it can use io uring to submit the request that of flush, we 1) forcefully disabled compactions by setting
to Linux kernel. Linux kernel asynchronously loads data from a configuration parameter called disable auto compactions
multiple SST files where necessary. Once loading is finished, to ‘True’ for RocksDB and 2) kept an exceptionally large
Linux kernel sends a completion signal to RocksDB. RocksDB number of memtables at runtime to avoid the occurrence of
composes and returns the result to the user. Though, the main- flushes. As shown by Figure 2a, without compaction, the
line of RocksDB has no use of io uring to reshape write I/Os. throughput of RocksDB increased by 5.1×, 5.5 × and 5.7×
Yet some practitioners have tried to modify RocksDB with on HDD, SATA and NVMe SSDs, respectively. With flush
io uring [31, 32]. Nonetheless, their modifications marginally also disabled, the throughput of RocksDB further increased
boost performance for RocksDB, because the way they utilize by 15.3%, 21.9%, and 11.7%, respectively. As a result, we
io uring does not locate or resolve the indeed bottleneck on focus on revolutionizing the procedure of compaction while
the critical path of compaction (see Section VI). keeping the original flush mechanism.
The use of NVMe SSD. NVMe SSDs are increasingly O2: In a compaction, RocksDB spends a long time wait-
deployed for storage. NVMe SSD can be used differently from ing for the completion of synchronous write and fsync
legacy devices. For example, the conventional interrupt-driven I/Os that a kernel thread is working on. Figure 3 illustrates
I/O model is generally less efficient than polling-driven I/Os the procedure of compaction we overhaul with RocksDB.
for NVMe SSD, because the latency induced by interrupts has RocksDB employs a background user-space thread to proceed
become prohibitive compared to NVMe SSD’s raw write/read a compaction job. RocksDB firstly preprocesses involved KV
latency [21, 35]–[37]. I/O polling eliminates the need for pairs by reading them from input SST files ( 1 in Figure 3).
Buffer filled for 64 times to write an SST file
fsync the
SST file
Time
Buffer filled once T0 T1 Compaction 1 generates T2 T3
… … … (!) (&) output SST files including (') (*)
𝐿! 𝐿$ (#) (%)
𝐿! and 𝐿! at T1;
𝐿$ 𝐿%
① ②③ ④ ②③ ④ ②③ ④ ⑤ ⑥ ($) (') (() (+)
𝐿! 𝐿$ Compaction 2 takes SST 𝐿$ 𝐿%
(#) (%)
Repeated N times for N SST files files including 𝐿! and 𝐿!
(%) (() ()) (,)
𝐿$ 𝐿$ as input at T2. 𝐿% 𝐿%
① Preprocess ④ Write buffered data to file
Input Output (') Input Output
② Iterate KV-pairs ⑤ fsync and close the file 𝐿& : the 𝑖-th SST file at
level 𝐿& .
③ Place a KV-pair into buffer ⑥ Postprocess Compaction 1 Compaction 2
Fig. 3: An illustration of overhauled compaction that generates Fig. 5: An illustration of compaction flow that generates
N SST files descendant SST files

3.6% RocksDB remains stable on Ext4 or XFS. However, leveraging


HDD
13.5% 6.1% 82.9% the asynchronous commit of Ext4 journaling, NobLSM only
SATA SSD yields 1.48× throughput than RocksDB’s on NVMe SSD.
36.4% 6.3% 57.5%
With Figure 4 and NVMe SSD, assuming that fsync was
NVMe SSD
47.7% 46.0% removed for compaction, the expected boost should be 1.85×
1
0.0% 20.0% 40.0% 60.0% 80.0% 100.0% ( 1−46.0% ). Our analysis shows that, the inferior achievement
Computation Write fsync of NobLSM is mainly due to the cost incurred by maintaining
kernel-space structures and handling customized system calls
Fig. 4: The breakdown of compaction
added to utilize the asynchronous commit of Ext4 journaling.
In all, the state-of-the-art NobLSM is suboptimal.
It builds an iterator over KV pairs to locate the one with the
smallest key ( 2 ). It places this KV pair in a buffer (1MB by With vanilla RocksDB, CPU computations are firstly done
default) and moves to the next smallest key. Once it fills up by RocksDB’s user-space thread while later file writes and
the buffer ( 3 ), RocksDB writes those KV pairs to an output fsyncs are conducted by a kernel thread with storage.
SST file ( 4 ). RocksDB reuses this buffer and keeps writing Besides NVMe SSD, we have also measured the percentages
the file until the file’s size reaches the limit of SST file (64MB of aforementioned three parts with HDD and SATA SSD. As
by default). Next, RocksDB calls fsync to persist this output shown in Figure 4, the speeds of two legacy devices are slower
SST file ( 5 ). It repeats foregoing actions 2 to 5 with the next than CPU’s computing speed and I/Os contribute 86.5% and
output SST files until all involved KV pairs are written and 63.6%, respectively. Assuming that we reschedule I/Os to be
persisted into files. Then it deletes input SST files ( 6 ). 2 to asynchronous with slower HDD or SATA SSD, neither has
5 actions are synchronously and repeatedly occurring on the an access speed that is comparable against the computation
critical path of compaction. Though, file writes and fsyncs power of CPU. A mass of I/Os are likely to aggregate and in
are done by a kernel thread. In the meantime, RocksDB’s turn block subsequent compactions. However, on our platform,
compaction thread stays waiting for the completion signals for the speeds between NVMe SSD and CPU approximately
these I/Os without doing anything meaningful. We tracked the achieve a balance. RocksDB does compactions in a best-
respective percentages for computation actions ( 1 2 3 6 ), file effort fashion and timely moves to the next compaction job
writes ( 4 ), and fsyncs ( 5 ) in the time cost per compaction. after finishing the current one. When NVMe SSD is serving
As shown in Figure 4, with NVMe SSD, they take 47.7%, prior compaction’s I/Os with kernel thread, CPU can work
6.3%, and 46.0%, respectively, on average. for current compaction’s computation with RocksDB’s user
O3: State-of-the-art LSM-tree variant that targets re- thread. Concretely, balanced CPU and storage are promising
ducing fsyncs is suboptimal, while it is exploitable to to make a pipelined compaction stream.
reschedule computations and I/Os for efficient and smooth O4: The durability of output SST files generated in a
compactions. Recently, Dang et al. [26] have considered compaction can be backed by input ones and they shall be
removing synchronous fsyncs for LSM-tree and proposed made durable before being used as input in a later com-
NobLSM. In short, as a journaling file system, Ext4 asyn- paction. A compaction only redeploys KV pairs between input
chronously persists file data in a periodical commit fashion, and output SST files, without producing new ones. As long as
which NobLSM leverages to replace fsync in a compaction. input SST files remain durable, the durability of compacted
However, NobLSM particularly relies on Ext4 file system and KV pairs is not impaired. When output SST files a compaction
demands handcrafted modifications in Linux kernel to track generates are to be used as input for a later compaction, they
the completion of asynchronous commit for SST files [26]. must be durable to back the durability of newer output SST
Moreover, by implementing NobLSM with RocksDB, we find files. Figure 5 shows two related compactions over time. At
(4) (5) (7) (8)
that NobLSM does not significantly boost the performance T2 , L1 , and L1 shall be durable. Otherwise, L2 , L2 ,
(9)
of RocksDB. Figure 2b compares NobLSM against RocksDB and L2 may have flawed durability. If we look back at T2 ,
(3) (4) (5)
running on Ext4 and XFS upon serving aforementioned fill- given durable L1 , L1 and L1 , the input SST files used to
random workload with three devices. The performance of generate them through Compaction 1 can be safely deleted.
By referring to O1 to O3, we aim to revolutionize compaction Repeated 64 times for one SST files
and optimize LSM-tree with asynchronous I/Os. The io uring ④ ⑥ ⑦
Compaction …… ……
and NVMe SSD jointly provide us an opportunity to do so with thread
① ②③ ⑨
software and hardware supports, respectively. Nonetheless, a
Kernel …
few challenges emerge for us to consider. One is how to thread
⑤ ⑧
schedule CPU computations (resp. user threads) from disk I/Os
(resp. kernel threads) at inter- and intra-compaction dimen- ① Preprocess ⑥ Compaction thread waits for
② Iterate KV-pairs completion of async writes
sions, without incurring any loss to the rationality of LSM-tree.
③ Place a KV-pair into buffer ⑦ Call async fsync to persist files
The other one is how to gain both visibility and durability for
④ Call async write to write ⑧ Kernel thread persists files
KV pairs regarding a revolutionized compaction procedure. buffered data to file
With synchronous file writes and fsyncs, compacted KV ⑤ Kernel thread writes data to file ⑨ Postprocess

pairs are both visible for access and durable on disk at


Fig. 6: The flowchart of AisLSM’s compaction
the end of a compaction. Asynchronous fsyncs bring non-
deterministic durability to them. O4 implies that, we can retain initiates an asynchronous write targeting an output SST file
the durability of compacted KV pairs by transiently keeping with the pointer of buffer ( 4 ). Once the size of transferred data
input SST files and postpone enforcing the durability to output approaches the preset size limit for SST file, AisLSM submits
SST files until they participate as input in a future compaction. a compound asynchronous write request for the entire file and
This helps to ensure the durability and visibility for KV pairs. then moves to the next output SST file for filling. The kernel
thread handles file write I/Os ( 5 ) while the user-space com-
IV. D ESIGN OF A IS LSM paction thread continues without blocking. AisLSM repeats
A. Overview these actions until all KV pairs are transferred and submitted
AisLSM separates CPU computations from disk I/Os in for asynchronous writes. Then it waits for the completion of all
each compaction. It conducts I/Os asynchronously in the asynchronous file write I/Os ( 6 ). As AisLSM overlaps CPU
background, thereby reducing the latency of intra-compaction computations and disk I/Os for consecutive SST files, it can
critical path (Section IV-B). Its compaction thread mainly receive timely completion signals from underlying file system.
handles CPU computations and no longer keeps waiting for The completion signals mean that all files have been written to
the I/O completion that a kernel thread is working on. This the OS’s buffer cache or storage device’s disk cache [39]. In
entails much shorter user-facing latency for serving fore- spite of being not steadily durable on disk, file system ensures
ground requests. At the inter-compaction level, the shortened that these output SST files are visible.
compaction enables AisLSM to simultaneously proceed CPU AisLSM then launches a compound asynchronous fsync
computations and disk I/Os that are respectively belonging to request for all of them ( 7 8 ). Such a grouped submission for
consecutive compactions at runtime (Section IV-C). Overall, persisting multiple files in a batch differs from conventional
AisLSM gains both high throughput and short latency. compaction which calls fsync every time a new SST file is
AisLSM follows RocksDB’s policy to select overfilled fully filled. More important, AisLSM does not synchronously
levels and SST files with overlapped key ranges for com- wait for the completion of asynchronous fsync but returns
paction. Moreover, it does not radically change the tiered after a short postprocess that concludes the compaction ( 9 ),
structure of LSM-tree or underlying system software. Because e.g., recording the generation dependency between input and
of asynchronous writes and fsyncs, AisLSM decouples the output SST files. As shown in Figure 6, I/Os are offloaded
durability from visibility for compacted KV pairs. It tracks to a kernel thread while AisLSM’s compaction thread mainly
the generation dependency between parental input SST files focuses on CPU computations. A comparison with Figure 3
and offspring output ones in order to persist and delete them, conveys that the critical path of AisLSM’s compaction is
respectively, in a deferred fashion (Section IV-D). By doing significantly shortened.
so, AisLSM incurs no loss of data visibility and consistency. Asynchronous writes for synchronous accessibility. With
6 and 7 , AisLSM synchronously waits for the completion
B. AisLSM’s Asynchronous I/O Model of asynchronous write I/Os and submits a request for asyn-
The procedure of revolutionized compaction. Figure 6 chronous fsync, respectively. The reason AisLSM does so
shows the flowchart of AisLSM’s steps in processing a com- is threefold. Firstly, the completion signals of write I/Os are
paction. AisLSM follows RocksDB to select an overfilled level essential and critical, as only on receiving completed file writes
as well as input SST files for compaction. It chooses one level can AisLSM initiate the asynchronous fsync onto those files.
Ln that maximally exceeds Ln ’s respective capacity limit as Secondly, one goal of compaction is to sort and reorganize
victim (n ≥ 0). AisLSM’s compaction thread preprocesses KV pairs that have been distributed across SST files, which
input SST files by loading KV pairs stored in them ( 1 in is for ease of locating and accessing data. As mentioned,
Figure 6). It makes an iterator over input files to sort KV asynchronous file writes performed by the io uring put data
pairs ( 2 ). It places sorted KV pairs one by one in ascending into the OS’s buffer cache in the buffered I/O mode or storage
order of keys into a user-space buffer ( 3 ). When the buffer device’s disk cache in the direct I/O mode. A completion
is fully filled, AisLSM transfers data to a kernel thread. It signal returned by file system in either mode makes written
KV pairs visible and accessible but without deterministic ① Input SST files
1, 6, 2, 11, ⑧ Wait for fsync completion
durability. Thus the synchronous wait enforces a deterministic L0
15, 24 16, 23 ⑨ Output SST files (Future compaction)
visibility for compacted KV pairs. Thirdly, according to our Compaction
L1 7, 8, L1 1, 2, 8, 11, 16, 21 1, 2, 8, 11, 16, 21
study in Section III, all file writes cost about 6.3% of total time 12, 21 6,7 12,15 23, 24 6,7 12,15 23, 24

with conventional compaction on NVMe SSD (see Figure 4). ②Merge Sort
⑦ Async fsync submission
As we reschedule and overlap file write I/Os alongside CPU 1, 2, 6, 7, 8, 11, 12, 15, 16, 21, 23, 24
computations ( 2 3 and 5 in Figure 6), the compaction thread ⑥ Async fsync preparation
④ Write submission Captions
is unlikely to wait a long time for asynchronous file writes. 1,2
1, 2, 1,2, Async write but not
Asynchronous fsyncs for deferred durability. In con- 6,7 6,7 6,7 completed yet
…… ……
trast to waiting for asynchronous writes, AisLSM does not ③ Async write
⑤ Wait for
Async fsync but not
completed yet
16, 21 16, 21
stall to pend the completion of asynchronous fsync ( 8 23, 24
write
23, 24 Durable SST file
completion
in Figure 6). It also does not immediately remove old input
SST files like conventional compaction. AisLSM retains input
Fig. 7: An example of AisLSM’s compaction
SST files to back the durability of compacted KV pairs since
new output SST files are not synchronously persisted. AisLSM
defers the completion check-up of persisting new output SST previous q flushes or compactions (1 ≤ q ≤ p). AisLSM
files until they are chosen as input for future compaction. At synchronously makes L0 SST files durable. For any other file
that moment, the old SST files from which they are generated that stays at Ln (n ≥ 1) and is to participate in the current
could be safely discarded (see Section IV-D). compaction, AisLSM has tracked in which past compaction,
say, ζ, the file was submitted for asynchronous fsync. As
C. Inter-compaction Pipelining AisLSM calls asynchronous fsync for a compound batch
Because AisLSM leaves computations only on the critical of all SST files per compaction, it checks whether the entire
path of compaction, the compaction thread swiftly finishes the batch for ζ is already persisted or not. If so, AisLSM safely
current job and is soon ready to take the next compaction deletes SST files that had been used as input parents for ζ.
job. When the next compaction’s computations are ongoing on Otherwise, AisLSM synchronously waits for the completion
CPU, the storage device is handling fsync for the previous of asynchronous fsync, which, as observed in our empirical
compaction. In this way, AisLSM pipelines CPU computa- tests, is very rare in practice. Then AisLSM deletes parental
tion and disk I/Os for consecutive compactions. Conventional SST files for ζ. Those input SST files for ζ thus can be
compaction thread arranges computations and I/Os in a strictly viewed as the grandparents of output SST files that the current
serial sequence; hence, when I/Os are being processed, CPU compaction job is going to generate.
(3) (4)
core is staying idle in the meantime, and vice versa. AisLSM, Let us reuse Figure 5 for illustration. At T1 , L1 , L1 ,
(5) (0) (1)
however, neatly engages CPU core in computing for a newer and L1 are not durable yet and AisLSM keeps L0 , L0 ,
compaction while a kernel thread is simultaneously dealing (2) (4) (5)
and L1 until T2 . At T2 , as L1 and L1 participate in Com-
with storage I/Os for prior compaction. As a result, AisLSM paction 2 as input, AisLSM checks if the asynchronous fsync
embraces high utilizations for both CPU and storage. performed to three output files Compaction 1 generated is
(0) (1) (2)
completed or not. If so, it safely deletes L0 , L0 , and L1 .
D. Deferred Deletion upon Asynchronous fsyncs
For a flush that transforms an immutable memtable to an L0 V. I MPLEMENTATION
SST file, AisLSM synchronously calls fsync to persist the
file. This fsync builds a solid foundation for the durability We leverage io uring to implement AisLSM with RocksDB
of KV pairs. AisLSM views L0 SST files as the ancestors (Section V-A). We also comprehensively consider multiple
of all SST files staying at lower levels to be generated in aspects to optimize and enhance AisLSM (Section V-B).
afterward compactions. Each compaction can be viewed as a
A. Implementation of AisLSM
process of generating offspring output Ln+1 SST files from
parental input Ln and Ln+1 SST files (n ≥ 0). With regard Overview. We take RocksDB to prototype AisLSM while
to asynchronous fsyncs, AisLSM needs a time at which it the ideas of AisLSM can be applied to other LSM-tree vari-
checks up if offspring SST files have been concretely persisted ants. Doing asynchronous I/Os to revolutionize the procedure
and parental SST files can be accordingly deleted. As LSM- of compaction is orthogonal to other optimization techniques
tree steadily grows to more and more levels by compactions proposed to enhance LSM-tree. We mainly make use of the
and each SST file has a high likelihood of participating in io uring to implement AisLSM’s asynchronous writes and
a future compaction, AisLSM does the check-up when every fsyncs. Overall, the core functions of AisLSM add or change
compaction is about to load KV pairs from input SST files. about 1,624 lines of code (LOC) in RocksDB version 7.10.0.
AisLSM does the deferred check-up and deletion as follows. Compaction procedure. AisLSM follows RocksDB to 1)
A compaction takes in a set of p input SST files as parents flush an immutable memtable as an L0 SST file, 2) maintain
(p ≥ 1). Let us denote the set as Pb (|Pb| = p). All background threads for flush and compaction jobs, and 3) cal-
members of Pb used to be offspring SST files generated in culate scores to choose an overfilled level and input SST files
with key ranges overlapped for compaction. Figure 7 illus- the check-up. Searching KV pairs in this file is also unaffected
trates eight main steps with which AisLSM does with an com- since file system accommodates KV pairs in the OS’s buffer
paction. In these steps, AisLSM uses the io uring’s structures cache or storage device’s disk cache. AisLSM explicitly calls
such as uring_queue to collect data for each SST file. It fsync for a retry to fix the I/O error. In the worst case, it
calls io uring’s interfaces such as io_uring_prep_fsync regenerates and replaces that problematic file.
and io_uring_submit to prepare an asynchronous fsync Outlier SST files. In unusual cases, some SST files, once
and submit an I/O request, respectively. generated in a compaction, hardly participate in subsequent
Deferred check-up and deletion. At the beginning of a compactions, because the key ranges they cover might not
compaction, AisLSM checks if input parental SST files are be frequently used (i.e., outliers). AisLSM still ensures the
already durable ( 8 ). If so, it removes grandparental SST durability of such inactive outlier SST files. This is the other
files by inserting them into a collection vector that RocksDB reason why AisLSM submits one request for all SST files
has managed for the purpose of deleting files. The check-up generated in a compaction to schedule a compound asyn-
does not cost much time. For example, in dealing with the chronous fsync. As long as any one of them is to be involved
aforementioned test of putting 80GB of KV pairs, AisLSM in a future compaction, AisLSM checks if the asynchronous
spent overall 749.1 seconds, out of which all check-up actions fsync has been done for all relevant SST files. By doing
cost about 0.01 ms. Such a time cost is negligible. so, AisLSM avoids overlooking outliers. This also helps to
Version and state tracking. RocksDB has a Manifest file delete their parental SST files. In addition, there might be a
with an in-memory Version to record the change of SST very low likelihood that outliers form a batch and have no
files (see Figure 1). As AisLSM decouples the visibility and opportunity to be compacted again. AisLSM has tracked all
durability for Ln SST file (n ≥ 1), it tracks and updates the SST files asynchronously persisted with io uring. It schedules
state of each Ln SST file in the Manifest file and Version. a specific check-up in off-peak hours for such outliers.

B. Optimizations and Complements VI. E VALUATION


Concurrent compactions. AisLSM maintains background A. Evaluation Setup
threads to do flush and compaction jobs. It makes com- Platform. The machine used for evaluation is an HP Z2
putations of current compaction thread execute on a CPU G4 workstation. It is with an Intel Core i9-9900K CPU
core while a kernel thread of io uring is simultaneously and 64GB DRAM as main memory. The OS is Ubuntu
handling I/Os with storage for prior compaction, without 22.04.1 with Linux kernel 6.2.7 installed on an HDD (Western
blocking the user-space compaction thread. Today multi-core WD20EZWX-60F5KA0 in 2TB). There are two additional
and many-core CPUs have gathered momentum. NVMe SSD SSDs in the machine. One is an SATA SSD (Samsung 870
also contains numerous hardware queues for parallel I/O EVO in 1TB) and the other one is an NVMe SSD (SK
streams [21, 36] while Linux kernel has blk-mq with multiple Hynix HFS512GD9TNGL2A0A in 480GB). The latter is used
software queues [20, 40]. AisLSM supports multiple threads or as the main storage device to hold data for all LSM-tree
instances concurrently conducting compaction jobs. Its asyn- variants throughout following experiments. The compiler is
chronous I/O model, when deployed on multiple compaction GCC/G++ version 9.5.0. The version of io uring (liburing) is
threads, can exploit the parallelism capabilities of both CPU 2.3. Because XFS and io uring have been jointly optimized
and storage for effectual concurrent executions. for efficient use of asynchronous I/Os [41, 42], we mainly use
I/O polling. NVMe SSD embraces much shorter access XFS except for NobLSM that needs a customized Ext4.
latency than SATA SSDs. Many researchers have used the I/O Benchmarks. One benchmark we use is the db bench
polling mechanism, instead of conventional I/O interrupts, to micro-benchmark built in RocksDB. It can synthesize contin-
interact with NVMe SSD [20, 21, 35]–[37]. In implementing uous Put or Get requests in typical access patterns. The other
AisLSM, we also consider I/O polling with NVMe SSD one is the YCSB macro-benchmark emulating a suite of real-
and io uring. Note that the joint setup of I/O polling and world workloads [43]. On finishing a workload, they report
io uring currently works only when a file is opened with the the throughput (MB/s) or execution time that we utilize as the
O DIRECT flag, i.e., in the direct I/O mode [38]. metrics to measure and compare performances for LSM-trees.
Failed I/Os. I/O errors might take place over time. When an Competitors. Besides the vanilla RocksDB [3], we choose
I/O operation fails, the conventional synchronous I/O model a few state-of-the-art LSM-tree variants that represent different
helps LSM-tree handle the error in a timely fashion. As approaches researchers have explored to improve performance
AisLSM waits for the completion signals of all asynchronous for LSM-tree. They are ADOC [16], TRIAD [19], Rocks-
file writes, any I/O error occurring at these writes can be bu [31], SILK [12], PhotonDB [32] and NobLSM [26]. All
swiftly detected and processed like with conventional LSM- of them except NobLSM are open-source, implemented with
tree. AisLSM defers the check-up of asynchronous fsync, RocksDB. As to NobLSM built atop LevelDB, we implement
so detecting and handling I/O errors for fsync are also it by modifying RocksDB and Ext4 for a fair comparison. Be-
postponed. However, even if an I/O error happens in persisting low we briefly summarize the characteristics of these variants.
an SST file, the durability of KV pairs stored in the file is not ADOC. ADOC monitors the data flow among multiple
impaired since AisLSM has retained parental SST files until components of LSM-tree. Accordingly, it adjusts the number
of threads and size of SSTable to control processing rate and tion and wear leveling that impact performance results [48]–
schedule the frequency of background jobs, thereby in turn [50]. We choose a volume of 80GB to alleviate such impact.
controlling the data flow within LSM-tree to reduce stalls. Figure 8a to Figure 8d capture the bandwidth of each LSM-
TRIAD. Firstly, TRIAD tries to separate hot KV pairs tree variant with four workloads. Let us analyze the results
that are frequently updated from cold ones at the memtable. in three aspects. Firstly, using its asynchronous I/O model
Secondly, TRIAD postpones a compaction until the overlap to revolutionize compaction, AisLSM significantly boosts the
between key ranges of SST files aggregates to some extent. performance of RocksDB. For example, with fillrandom and
Thirdly, it makes use of WAL to play the role of L0 SST file, seven increasing value sizes, the throughput of AisLSM is
instead of writing the same KV pairs again. TRIAD attempts 1.3×, 1.4×, 1.6×, 1.8×, 2.0×, 2.1×, and 2.14× that of
to alleviate write amplification with these techniques. RocksDB. As AisLSM offloads file write and fsync I/Os
Rocks-bu. Rocks-bu is the other RocksDB variant that we to kernel threads and makes compaction thread focus on CPU
have found being with the use of io uring. A team of three computations, it is able to quickly complete a compaction job
practitioners made use of the batch I/O feature of io uring to and soon initiate the next one, thereby gaining high through-
group I/O requests for RocksDB. put. We have further recorded the stall time for AisLSM.
SILK. SILK focuses on I/O scheduling between insertions As mentioned in Section III, RocksDB spent 1,399.9 seconds
with memtable, flushes, and compactions, mainly for shorter in finishing fillrandom with 1KB values while stalls lasted
tail latency. It allocates more bandwidth to internal operations, for 1,179.1 seconds. For AisLSM, the total execution time
i.e., flushes and compactions, when foreground service is not and stall time are 749.1 and 618.1 seconds, respectively. This
heavy. It gives higher priority to flushes and compactions at comparison further justifies that the removal of synchronous
lower levels (e.g., L0 → L1 ). Moreover, it allows compactions I/Os from the critical path of compaction effectually reduces
at lower levels to preempt ones at higher levels. the stall time and in turn boosts performance for LSM-tree.
PhotonDB. A group of Alibaba’s researchers engineered In addition, given a fixed size of SST file (64MB), AisLSM
PhotonDB by using io uring and coroutines to subsume achieves higher performance with larger values. A larger value
threads in RocksDB [44]. Although they have been aware size means fewer KV pairs to be fitted in an SST file. LSM-tree
of asynchronous I/Os provided by io uring, they focus on starts off a compaction regarding the capacity of each level, or
applying coroutines and io uring to serve multiple clients in a measured in the practical unit, the number of SST files. With
network database. With a single client carrying one or multiple the same number of SST files for compaction, fewer KV pairs
foreground threads, PhotonDB still waits for the completion evidently need less time on merge-sorting keys and other CPU
of I/Os with io uring in the conventional synchronous manner. computations. As a result, AisLSM finishes a compaction in
NobLSM. NobLSM employs the periodical asynchronous a more prompt pace for larger KV pairs.
commit of Ext4 journaling to persist files and thus remove Secondly, AisLSM significantly outperforms state-of-the-
fsyncs out of the critical path of compaction. It relies on art prior works. For example, with overwrite workload and
Ext4 and demands handcrafted changes into Linux kernel. 1KB values, the throughput of AisLSM is 1.8×, 2.0×, 2.9×,
1.9×, 2.3×, 1.6×, and 1.4× that of original RocksDB, ADOC,
TRIAD, Rocks-bu, SILK, PhotonDB and NobLSM, respec-
B. Micro-benchmark
tively. These LSM-tree variants have undertaken different
We employ db bench to issue four typical workloads. In approaches to optimize LSM-tree, mainly on reducing the
particular, with each LSM-tree variant, we perform following performance penalty caused by compactions. ADOC and SILK
workloads in order: fillrandom (random insertion of KV pair), take into account the processing capability of storage device
overwrite (random update of KV pair), readseq (sequential and attempt to schedule compaction jobs with threads. Simi-
retrieval of KV pair), and readrandom (random retrieval of larly, one main technique of TRIAD is to postpone scheduling
KV pair). For each workload, we fix an overall quantity of compactions until it has to do so due to too many overlapped
data volume and follow the default uniform distribution to keys. Whereas, they place emphasis on scheduling at the
generate requests. We designate four foreground threads. Each granularity of compaction jobs but fail to realize the critical
thread puts (resp. gets) 20GB of KV pairs for write (resp. read) serial order of CPU computations and storage I/Os in a com-
requests. We set the key size as 16B while varying the value paction. Consequently, their compaction thread stays waiting
size to be 64B, 128B, 256B, 512B, 1KB, 2KB, and 4KB. The for I/O completion signals from kernel thread while CPU core
reason we use 80GB of data is twofold. Firstly, such a volume does not simultaneously work on meaningful computations.
of data concretely entails continual compactions with massive Given a mass of KV pairs that are continuously arriving to
CPU computations and disk I/Os over time. Secondly, the stress LSM-tree, the effect of scheduling compaction jobs is
NVMe SSD we are running on has 480GB raw capacity. We inferior and unsatisfactory. As to PhotonDB and Rocks-bu, a
must consider LSM-tree’s write amplifications that can occupy replacement over conventional concepts or interfaces without
multiple times of 80GB [13, 19, 27, 45]–[47]. We also take an in-depth study to locate concrete performance bottlenecks
into account that due to a composition of LSM-tree variants, is unlikely to bring about substantial gains. Let us still take
workloads, value sizes, and rounds, overwhelming data may overwrite and 1KB value size for example. With coroutine and
frequently trigger SSD’s internal modules like garbage collec- io uring, PhotonDB produces a marginal improvement over
Throughput (MB/s)

Throughput (MB/s)

Throughput (MB/s)

Throughput (MB/s)
800 800
140 100 100 1400
Throughput (MB/s)

Throughput (MB/s)

Throughput (MB/s)

Throughput (MB/s)
125 1250
700
120 80 600
1200
100 80 1000
600
100 60 1000
75 500
750 400
80 60 40 800
50 400
500
60 40 20 300 600
200
25 250
40 200 0 400
0 20 0 0
20 64 128256512 1K 2K 4K 64 128256512 1K 2K 4K 100 64 128256512 1K 2K 4K 200 64 128256512 1K 2K 4K
0 Value Size (Bytes) 0 Value Size (Bytes) 0 Value Size (Bytes) 0 Value Size (Bytes)
64 128 256 512 1K 2K 4K 64 128 256 512 1K 2K 4K 64 128 256 512 1K 2K 4K 64 128 256 512 1K 2K 4K
Value Size (Bytes)
(a) fillrandom Value Size (Bytes)
(b) overwrite Value Size (Bytes)
(c) readseq (d)Value Size (Bytes)
readrandom
RocksDB ADOC TRIAD Rocks-bu SILK PhotonDB NobLSM AisLSM

Fig. 8: A comparison between LSM-tree variants on db bench’s fillrandom, overwrite, readseq, and readrandom

RocksDB by 14.1% higher throughput. For Rocks-bu, the use (Section VI-C2). As AisLSM shortens the critical path of com-
of io uring’s batch I/O only even degrades performance by paction, we measure how much it reduces the user-facing tail
4.2% compared against original RocksDB. latency (Section VI-C3). Regarding implementation and opti-
NobLSM is inferior to AisLSM with both fillrandom and mization techniques AisLSM contains, we further analyze the
overwrite workloads. The most significant gap between them contribution from each of them and figure out the root cause of
is 1.53× with fillrandom and 4KB values. The reason is performance boost for AisLSM (Section VI-C4). We next test
twofold. Firstly, NobLSM does not consider scheduling file if AisLSM works on another platform (Section VI-C5), with
write I/Os but still conducts them synchronously. Whereas, multiple compaction threads (Section VI-C6), and multiple
AisLSM asynchronously deals with both file write and fsync instances (Section VI-C7).
I/Os. Secondly, the time cost of checking if SST files are 1) Crash Consistency Test: To test the crash consistency of
asynchronously committed is non-trivial, particularly with fast AisLSM, we use the command ‘halt -f -p -n’ to sud-
NVMe SSD. For each compaction, NobLSM submits all denly power off Linux when writing KV pairs with db bench’s
output SST files for tracking with one customized system call fillrandom [26]. We repeat this test for five times successively
and later asks Ext4 to check for every file, resulting in multiple with RocksDB and AisLSM. We find that, besides KV pairs
system calls. NobLSM employs a global kernel-space table being appended to WAL, ones residing in SST files are
to record and track SST files. It is time-consuming to insert recoverable and retrievable for both RocksDB and AisLSM.
and query each file with the table, especially when many SST By default, they do not persist WALs with fsyncs. The main
files gradually accumulate due to continuous compactions. difference between them is that AisLSM does not wait for
Comparatively, for one asynchronous fsync, AisLSM both the durability of output SST files per compaction. It also does
submits a request and collects the result by calling respective not immediately delete input SST files. AisLSM flushes L0
io uring interfaces only once. In addition, AisLSM does not SST files to persist all KV pairs received from users. Only
rely on any file system or handcrafted Linux kernel, which after check-up will it delete SST files used as input for past
is a stark contrast to NobLSM. To sum up, AisLSM is much compactions. By tracking the generation dependency between
more effectual and portable than NobLSM. SST files, AisLSM guarantees any KV pair sinking down from
Thirdly, although LSM-tree is generally used to serve write- Ln to Ln+1 (n ≥ 0) is traceable and durable. All these jointly
intensive workloads, we have tested AisLSM’s capability in enable AisLSM’s crash recoverability.
serving read requests. As shown Figure 8c and Figure 8d, 2) Data Accessibility Test: We measure whether AisLSM
AisLSM is comparable to RocksDB and NobLSM, while some manages to find out all KV pairs under search. To do so,
LSM-tree variants exhibit dramatically low performance. For we first run fillrandom by engaging a foreground thread in
example, the throughputs of TRIAD and PhotonDB are just putting down KV pairs in 20GB with various value sizes
15.2% and 13.5% of AisLSM’s, respectively, with readrandom and then search keys with readrandom. We note that newer
and 1KB values. The change AisLSM incurs to read procedure RocksDB since version 6.2 no longer guarantees that db bench
is just to load KV pairs from a transiently non-durable SST always searches for a stored key. Instead, db bench randomly
file. This, however, does not affect the visibility of data or generates target keys and some of them might not be existent
the actions of locating a specific key. AisLSM thus achieves a in the LSM-tree. The randomization is based on a seed related
well balance between write and read performances while, for to the search time by default. RocksDB provides an option
instance, TRIAD keeps too many files at L0 with overlapped to fix the seed so that we can repeat the same readrandom
key ranges that are not friendly to searches [19, 51]–[53]. test case. Under equivalent search conditions, AisLSM and
RocksDB locate the same number of KV pairs, with about
C. Deep Dissection with AisLSM 63.2% of all searched keys found. This justifies that AisLSM’s
We have done various experiments to deeply evaluate data accessibility is identical to that of RocksDB.
AisLSM. We validate if it guarantees crash consistency (Sec- 3) Tail Latency: In addition to throughput, the tail latency
tion VI-C1). We test if it ensures the accessibility of KV pairs is another critical performance metric, especially for latency-
TABLE I: The tail latency (99P) for LSM-tree variants 250

Throughput (MB/s)

Throughput (MB/s)
100 RocksDB RocksDB
AisLSM 200 AisLSM
Value Rocks- Rocks- Photon- Nob- Ais- 80
size DB
ADOC TRIAD
bu
SILK
DB LSM LSM 60 150

64B 3.9 3.7 9.9 3.8 5.3 4.0 5.0 3.4 40 100
256B 8.5 8.9 11.3 8.4 5.0 10.2 8.1 4.0 20 50
1KB 20.1 21.4 1,030.3 25.6 12.4 20.7 26.6 10.3
0 0
4KB 1,932.0 2,033.7 1,225.0 1,167.0 9.3 1,584.0 2049.0 21.0 64 128 256 512 1K 2K 4K 1 2 4 8
Value Size (Bytes) Compaction Threads

(a) Fillrandom on the other plat- (b) Fillrandom with multiple


Throughput (MB/s)

Throughput (MB/s)
125 600 form compaction threads
100
140 75 700400 Fig. 10: The impacts of platform and compaction threads
Throughput (MB/s)

Throughput (MB/s)
120 600
50

Throughput (MB/s)

Throughput (MB/s)
100 500200
80 25 400 AisLSM 60 AisLSM
60
60 0 300 0 p2KVS p2KVS
40
64 128 256 512 1K 2K 4K 200 64 128 256 512 1K 2K 4K
Value Size (Bytes) Value Size (Bytes) 40 40
20 100
0 0 20 20
64 (a) fillrandom
128 256 512 1K 2K 4K 64 128 256 512 1K 2K 4K(b) readrandom
Value Size (Bytes) Value Size (Bytes)
0 0
RocksDB AisLSM-fsync AisLSM-interrupt AisLSM 64 128 256 512 1K 2K 4K 64 128 256 512 1K 2K 4K
Value Size (Bytes) Value Size (Bytes)

Fig. 9: A comparison between AisLSM’s components (a) fillrandom (b) overwrite


Fig. 11: A comparison between p2 KVS and AisLSM
sensitive applications [12, 18, 54]. For instance, SILK was
designed to resolve the issue of high tail latency for LSM- larger KV pairs. As mentioned, given an SST file in a fixed
tree [12]. We record the 99th percentile (99P) tail latency when size to hold larger values, disk I/Os become more time-
each LSM-tree variant was serving the fillrandom workload. consuming than CPU computations per compaction. There-
We show their tail latencies in Table I with four value sizes. fore, I/O polling is more efficient than interrupt-driven mode
Removing synchronous I/Os on the critical path of compaction to deliver data into NVMe SSD. With 4KB values, the gap
makes AisLSM significantly reduce the tail latency. It gen- between AisLSM-interrupt and AisLSM is as wide as 31.2%.
erally outperforms other LSM-tree variants, including SILK This justifies that the optimization with I/O polling is neces-
in the most cases. With 1KB values, the 99P tail latency of sary and gainful for AisLSM. Thirdly, AisLSM yields slightly
AisLSM is 48.8%, 51.9%, 99.0%, 59.8%, 16.9%, 50.4%, and lower read performance, e.g., by 5.2% with 1KB values, than
61.4% less than that of RocksDB, ADOC, TRIAD, Rocks- AisLSM-interrupt with I/O interrupts on handling readrandom
bu, SILK, PhotonDB and NobLSM, respectively. These results requests. As mentioned in Section V-B, currently the io uring
complement our observations with AisLSM’s high throughput. only works in the direct I/O mode for I/O polling. Compared to
Given a shorter latency to finish a compaction, AisLSM is able AisLSM-interrupt, AisLSM directly loads KV pairs from SST
to process more compaction jobs and incur fewer stalls. files without the use of OS’s buffer cache. This explains the
4) Impacts of asynchronous writes, fsyncs, and I/O marginal difference between AisLSM-interrupt and AisLSM
polling: We have configured and tested three variants for when reading data. To enhance AisLSM’s capability in serving
AisLSM in order to thoroughly figure out the root cause of read requests, we can consider incorporating user-space buffers
performance gain. The first one relies on io uring to only to effectively cache KV pairs [51, 55].
conduct asynchronous fsyncs for each compaction. The 5) The Impact of Platform: We test AisLSM on the other
second one does asynchronous I/Os with io uring for both machine with Intel Xeon Gold 6342 CPU and SAMSUNG
file writes and fsyncs, but all I/Os are interrupt-driven. The MZ7LH960 SATA SSD in 960GB. Figure 10a compares
third one is the full version that has been used for comparison AisLSM against RocksDB in serving fillrandom workload
in Section VI-B. It is similar to the second one except being issued by four foreground threads and overall 80GB data.
tuned with the I/O polling and direct I/O mode for NVMe AisLSM is still more performant. However, due to the changes
SSD. The three variants are thereafter referred to as AisLSM- of both CPU computation power and SSD access speed, the
fsync, AisLSM-interrupt, and AisLSM, respectively. highest leap (1.5×) over RocksDB occurs at 512B value size.
We still let db bench operate with 80GB KV pairs with four 6) The Impact of Compaction Threads: RocksDB employs
foreground threads. Figure 9a and Figure 9b present all vari- one compaction thread by default. We varied the number of
ants’ throughputs with different value sizes upon processing compaction threads as 1, 2, 4, and 8. Without loss of generality,
fillrandom and readrandom, respectively. We obtain three main we run fillrandom with 1KB values and engaged 16 foreground
observations with these two diagrams. Firstly, the removal of threads, each issuing 20 million requests for more pressure. As
fsyncs alone from the critical path of compaction is able to shown by Figure 10b, AisLSM yields 23.9% to 60.2% higher
substantially boost performance compared to RocksDB. This throughput than RocksDB. This confirms that AisLSM well
aligns with our observation in Section III, as synchronous supports multi-threading concurrent compactions.
fsyncs cost longer time than file writes (see Figure 4). 7) Multiple Instances with AisLSM: Using multiple in-
Secondly, the impact of I/O polling is more significant with stances to partition and serve KV ranges is a promising
1500 1000
1250 1500
800
Time(s)

Time(s)

Time(s)

Time(s)
1000 1000
750 1000 600
1750 400

Throughput (MB/s)

Throughput (MB/s)

Throughput (MB/s)

Throughput (MB/s)
1400
500 1400 500 1500 500
1200
1000
1200
1000 250 1250 1000
800 200
0 800 0 1000 800 0 600 0
600 750 600 400
(a) Load-A 400 (b) Workload A 500 400 (c) Workload B (d) Workload C
250 200 200
200
0
1000 0 1500
0 0
1500 Value Size(Bytes) Value Size(Bytes) Value Size(Bytes) Value Size(Bytes)
3000
Time(s)

Time(s)

Time(s)

Time(s)
800
Throughput (MB/s)

Throughput (MB/s)

(MB/s)

Throughput (MB/s)
1750 1500
1000 3500
1000
1000 1500 600 1250 3000 2000
1250 800 2500
400 1000
500

Throughput
500 1000 600
750 2000 1000
750 200 400 1500
500 500 1000
0 250
0 200 250 0 500
0
0 0 0 0
(e) Workload F Value Size(Bytes) (f) Workload D Size(Bytes)
Value (g) LOAD-E
Value Size(Bytes) Value Size(Bytes)(h) Workload E
RocksDB ADOC TRIAD Rocks-bu SILK
PhotonDB NobLSM AisLSM-fsync AisLSM-interrupt AisLSM

Fig. 12: A comparison between LSM-tree variants on YCSB’s workloads

approach for high throughput and scalability. p2 KVS [30] all its three variants mentioned in Section VI-C4. From these
is one representative using RocksDB as its instance. We diagrams we can obtain four observations.
also configure AisLSM as its instance. We set four instances Firstly, with write-dominant workloads, such as Load-A and
for both p2 KVS (RocksDB) and AisLSM and engaged each Load-E, AisLSM variants consistently yield higher perfor-
instance in serving 20 million requests with fillrandom and mance than state-of-the-art LSM-tree variants. For example,
overwrite workloads. Figure 11a and Figure 11b comparatively with Load-A, the time RocksDB, ADOC, TRIAD, Rocks-bu,
present the throughputs of AisLSM and p2 KVS on handling SILK, PhotonDB, and NobLSM spent is 1.8×, 1.8×, 2.5×,
two workloads, respectively. AisLSM has evident advantage 1.8×, 2.3×, 2.2×, and 1.1× that of AisLSM, respectively.
over p2 KVS on both workloads. For example, with 1KB This improvement is again accredited to the novel compaction
values, AisLSM yields 1.5× higher throughput than p2 KVS. procedure revolutionized by AisLSM with asynchronous I/O
AisLSM differs from p2 KVS in that an instance of the former model. The shortened time cost of compaction entails less
is more performant than an instance of RocksDB used by the stall time, thereby processing workloads with less service
latter. Given identical strategies for sharding and scheduling time. Secondly, as to read-dominant workloads, including
KV pairs among multiple instances, an instance of AisLSM workloads B, C, D, and E, AisLSM family yields comparable
processes requests at a much more prompt pace than p2 KV. or higher performance than state-of-the-art LSM-tree variants.
AisLSM is hence more efficient than p2 KVS. As to overwrite, For example, with workload B, the service time of RocksDB,
AisLSM outperforms p2 KVS in all value sizes. The highest ADOC, TRIAD, Rocks-bu, SILK, PhotonDB, and NobLSM
gap between them is 2.0× with 4KB values. is 1.4×, 1.1×, 3.5×, 1.2×, 1.2×, 1.9× and 1.2× that of
AisLSM, respectively. These results aligns with what we
D. Macro-benchmark have obtained with db bench’s readrandom. As illustrated
The Yahoo! Cloud Serving Benchmark (YCSB) [43] is by Figure 8d, TRIAD and PhotonDB have shown the lowest
a comprehensive, open-source tool that is widely used to throughputs. Thirdly, for a workload mixed with write and
evaluate the performance of LSM-tree-based KV stores. YCSB read requests, such as workloads A and F, AisLSM is still
provides six core workloads that emulate access patterns found performant over other LSM-tree variants. For example, to
in typical production environments. They are A (update heavy, finish these two workloads, original RocksDB demands 19.1%
50%/50% read/write), B (read mostly, 95%/5% read/write), C and 17.2% more time than AisLSM, respectively. Last but not
(read only, 100% read), D (read latest, 95%/5% read/insert), the least, AisLSM-interrupt, which is a variant of AisLSM
E (short ranges, 95%/5% range query/insert), and F (read- with interrupt-driven I/Os, is a bit faster than AisLSM in
modify-write, 50%/50% read-modify-write/read). These work- handling workloads with read requests. For example, with
loads are either write- or read-dominant, or a mix of write and workload E for range queries, AisLSM-interrupt working in
read requests. YCSB’s default size per KV pair is about 1KB. the buffered I/O mode cost 4.9% less time than AisLSM
We run YCSB workloads in the order of Load-A, A, B, C, F, working in the direct I/O mode. As mentioned, although the
D, Load-E and E by referring to previous works [25, 26, 56]. access speed of NVMe SSD is higher than legacy storage
We make Load-A and Load-E remove existing data in each devices, the use of OS’s or LSM-tree’s buffers would be
LSM-tree variant and put down 50 million of KV pairs. They helpful to serve read requests.
hence store roughly 50GB of data as the base. Every workload
carries ten million of requests to be served. Figure 12a to E. The Impact of Key’s Distribution
Figure 12h capture the service time for each LSM-tree in To observe the impact of the distribution of keys on the per-
such an order. Note that for AisLSM we show the results for formance of AisLSM, we conduct experiments with YCSB’s
1750
Throughput (MB/s)

Throughput (MB/s)

Throughput (MB/s)

Throughput (MB/s)
1400 1400 1000
1200
1500 3500
1200 (4KB by default). Then they tried to pipeline CPU computation
800
1000 2000 1250 1000
1000 3000
800 600
800
600 750
2500
600 and synchronous I/Os for consecutive blocks of every SST
1500
Time(s)

Throughput (MB/s) Time(s)


400
400 500 400
200 250 2000
200
200 file. Limited by the perspective at the block-level granularity
0 1000 0 0
0
1500
Value Size(Bytes) Value Size(Bytes) Value Size(Bytes)
1000
and obliviousness of asynchronous I/Os, they had to rely
Value Size(Bytes)
Throughput (MB/s)

Throughput (MB/s)

Throughput (MB/s)
1750 1500
3500
500 1000
1500
1250 800
3000
1250
500
2500
1000
on multiple parallel storage devices for high I/O bandwidth
1000 0 600
750 0
2000
to catch up the computing speed of CPU. Whereas, this
750 A-Zipf 400
A-Unif A-Latest 500
F-Zipf F-Unif 1500
F-Latest
500 1000
250
0
200
0
250
0
500 demands changes across user- and kernel-spaces for storage
(a)
Value YCSB worklaod
Size(Bytes) ValueASize(Bytes) (b) YCSB worklaod 0F Value Size(Bytes)
Value Size(Bytes)
management. Additionally, their pipeline might not be stable
RocksDB ADOC TRIAD Rocks-bu SILK
PhotonDB NobLSM AisLSM-fsync AisLSM-interrupt AisLSM over time, since small data in one or few blocks is difficult
for CPU and storage to process at a steadily stable speed.
Fig. 13: A comparison between LSM-tree variants with dif- Comparatively, AisLSM takes effect at the granularity of SST
ferent distributions file in scores of megabytes and exploits existing storage stack
(e.g., io uring and NVMe SSD) for implementation. It shall
workloads under three distributions, i.e., Zipfian, uniform, and have higher performance, viability, and stability.
latest. Because of space limitation, we present the execution Not many researchers considered the impact of fsyncs
time with workloads A and F, as both of them contain a used in compactions on the performance of LSM-tree [25, 26].
50%/50% mix of write and read requests. As indicated by Fig- Prior to aforementioned NobLSM, Kim et al. [25] proposed
ure 13, AisLSM outperforms other LSM-tree variants under BoLT. BoLT produces one huge output SST file for each
Zipfian and uniform distributions. With the latest distribution, compaction and persists all compacted KV pars in one aggre-
AisLSM achieves a comparable performance than RocksDB. gated fsync. This reduces the performance penalty caused by
The reason is that, the latest distribution always chooses the multiple fsync calls on individual SST files, but the eventual
most recent data for operation. Write and read requests are large fsync still synchronously occurs on the critical path.
hence mostly satisfied at LSM-tree’s memtables and block Researchers also studied the processing speeds for compu-
cache. With regard to frequent merges upon updating the tations and I/Os for compaction. Some of them used FPGA
same keys repeatedly at the memtable level, compactions are to accelerate computations [58, 59]. The emerging storage
not largely triggered. The performance gain of revolutionized devices like NVMe SSD and persistent memory (pmem) also
compaction is consequently marginal for AisLSM. attracted wide attention [20, 47, 60]. For example, Chen et
al. [20] proposed SpanDB that jointly makes use of faster
VII. R ELATED W ORKS
NVMe SSD and ordinary slower SSD to suit the characteristics
We have quantitatively discussed and evaluated a few prior of WAL and SST files for storage. It also uses I/O polling
works in Section VI. Some techniques of them have been with NVMe SSD. In addition, SpanDB and p2 KVS share
proved to be useful to reduce the performance penalty caused similarity in dedicating separate foreground and background
by compactions. For example, the way TRIAD separates hot threads to serve user requests and do flush or compaction
and cold KV pairs was also considered by other works. Huang jobs, respectively. By doing so, SpanDB aims to overlap fore-
et al. [27] found that even a small number of frequently ground services with background jobs at runtime. Meanwhile,
updated hot KV pairs would quickly fill up SST files and cause researchers developed LSM-tree variants [6, 7, 13, 14, 29, 61]
more compaction jobs over time. They accordingly install an to leverage the non-volatility and byte-addressability of pmem.
auxiliary log to distinguish and handle hot data. Decoupling Though, the winding down of Intel’s Optane DC memory
values from keys is another technique that can effectively business [62, 63] may impact the deployment of them.
lower the frequency of compactions, since a pointer (location)
of each actual value, instead of the entire value, is stored in VIII. C ONCLUSION
SST file [19, 28, 29]. Some researchers proposed concurrent In this paper, we overhaul the compaction procedure of
compactions [14, 30, 57]. For example, p2 KVS [30] mentioned LSM-tree. The critical path of a compaction job is composed
in Section VI-C7 partitions KV space into independent spaces of CPU computations and disk I/Os. At runtime, LSM-tree’s
and manages multiple instances correspondingly. Such in- compaction thread synchronously waits for the completion of
stances concurrently schedule and perform compaction jobs. file write and fsync I/Os that a kernel thread is handling. We
There are also research works that leverage buffers to acceler- accordingly develop AisLSM that overlaps CPU computations
ate search performance for LSM-tree [51, 55]. AisLSM’s rev- (resp. user thread) with disk I/Os (resp. kernel thread) for
olutionized compaction is complementary to these techniques consecutive compactions and, particularly, performs disk I/Os
and they can collaboratively take effect for high performance. with an asynchronous model. AisLSM also decouples the
Foregoing designs mainly attempt at the granularity of visibility from durability for compacted KV pairs. With a
compaction jobs. Some researchers considered dissecting the deferred check-up and deletion strategy, AisLSM ensures that
internals of a compaction. For example, Zhang et al. [17] data stored in SST files is visible and durable. We thoroughly
tried to make use of the parallelism between CPU and I/O evaluate AisLSM. Experiments show that, by shortening the
device, like what we have done in this paper. However, they critical path of compaction, AisLSM highly boosts the perfor-
decomposed a compaction job in the granularity of blocks mance of LSM-tree and outperforms state-of-the-art designs.
R EFERENCES [18] Y. Chai, Y. Chai, X. Wang, H. Wei, N. Bao, and Y. Liang, “LDC: A
lower-level driven compaction method to optimize SSD-oriented key-
[1] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Bur- value stores,” in 2019 IEEE 35th International Conference on Data
rows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributed Engineering (ICDE), 2019, pp. 722–733.
storage system for structured data,” in 7th USENIX Symposium on [19] O. Balmau, D. Didona, R. Guerraoui, W. Zwaenepoel, H. Yuan,
Operating Systems Design and Implementation (OSDI), November 2006, A. Arora, K. Gupta, and P. Konka, “TRIAD: Creating synergies
pp. 205–218. between memory, disk and log in log structured key-value stores,” in
[2] S. Ghemawat and J. Dean, “LevelDB,” March 2011, https://github.com/ 2017 USENIX Annual Technical Conference (USENIX ATC 17). Santa
google/leveldb. Clara, CA: USENIX Association, July 2017, pp. 363–375. [Online].
[3] F. D. E. Team, “RocksDB,” October 2017, https://rocksdb.org/. Available: https://www.usenix.org/conference/atc17/technical-sessions/
presentation/balmau
[4] T. A. S. Foundation, “Apache HBase,” January 2009, https://hbase.
[20] H. Chen, C. Ruan, C. Li, X. Ma, and Y. Xu, “SpanDB: A fast,
apache.org/.
Cost-Effective LSM-tree based KV store on hybrid storage,” in 19th
[5] A. Lakshman and P. Malik, “Cassandra: A decentralized structured
USENIX Conference on File and Storage Technologies (FAST 21).
storage system,” SIGOPS Oper. Syst. Rev., vol. 44, no. 2, p. 35–40, Apr.
USENIX Association, Feb. 2021, pp. 17–32. [Online]. Available:
2010. [Online]. Available: https://doi.org/10.1145/1773912.1773922
https://www.usenix.org/conference/fast21/presentation/chen-hao
[6] S. Kannan, N. Bhat, A. Gavrilovska, A. Arpaci-Dusseau, and R. Arpaci- [21] J. Chu, Y. Tu, Y. Zhang, and C. Weng, “Latte: A native table engine on
Dusseau, “Redesigning LSMs for nonvolatile memory with NoveLSM,” NVMe storage,” in 2020 IEEE 36th International Conference on Data
in 2018 USENIX Annual Technical Conference (USENIX ATC 18). Engineering (ICDE), 2020, pp. 1225–1236.
Boston, MA: USENIX Association, Jul. 2018, pp. 993–1005. [Online]. [22] J. Axboe, “Efficient io with io uring,” October 2019, https://kernel.dk/
Available: https://www.usenix.org/conference/atc18/presentation/kannan io uring.pdf.
[7] O. Kaiyrakhmet, S. Lee, B. Nam, S. H. Noh, and Y. ri Choi, “SLM-DB: [23] J. Corbet, “The rapid growth of io uring,” January 2020, https://lwn.net/
Single-level key-value store with persistent memory,” in 17th USENIX Articles/810414/.
Conference on File and Storage Technologies (FAST 19). Boston, MA: [24] B. Mottahedeh, “An introduction to the io uring asyn-
USENIX Association, Feb. 2019, pp. 191–205. [Online]. Available: chronous i/o framework,” https://blogs.oracle.com/linux/post/
https://www.usenix.org/conference/fast19/presentation/kaiyrakhmet an-introduction-to-the-io-uring-asynchronous-io-framework, May
[8] B. Lepers, O. Balmau, K. Gupta, and W. Zwaenepoel, “KVell: The 2020.
design and implementation of a fast persistent key-value store,” in [25] D. Kim, C. Park, S.-W. Lee, and B. Nam, “BoLT: Barrier-optimized
Proceedings of the 27th ACM Symposium on Operating Systems LSM-tree,” in Proceedings of the 21st International Middleware
Principles, ser. SOSP ’19. New York, NY, USA: Association Conference, ser. Middleware ’20. New York, NY, USA: Association
for Computing Machinery, 2019, pp. 447–461. [Online]. Available: for Computing Machinery, 2020, p. 119–133. [Online]. Available:
https://doi.org/10.1145/3341301.3359628 https://doi.org/10.1145/3423211.3425676
[9] K. Ren, Q. Zheng, J. Arulraj, and G. Gibson, “SlimDB: A space- [26] H. Dang, C. Ye, Y. Hu, and C. Wang, “NobLSM: An LSM-tree with
efficient key-value storage engine for semi-sorted data,” Proc. VLDB non-blocking writes for SSDs,” in Proceedings of the 59th ACM/IEEE
Endow., vol. 10, no. 13, pp. 2037–2048, sep 2017. [Online]. Available: Design Automation Conference, ser. DAC ’22. New York, NY, USA:
https://doi.org/10.14778/3151106.3151108 Association for Computing Machinery, 2022, p. 403–408. [Online].
[10] R. Wang, J. Wang, P. Kadam, M. Tamer Özsu, and W. G. Aref, “dLSM: Available: https://doi.org/10.1145/3489517.3530470
An LSM-based index for memory disaggregation,” in 2023 IEEE 39th [27] K. Huang, Z. Jia, Z. Shen, Z. Shao, and F. Chen, “Less is more: De-
International Conference on Data Engineering (ICDE), April 2023, pp. amplifying I/Os for key-value stores with a log-assisted lsm-tree,” in
2835–2849. 2021 IEEE 37th International Conference on Data Engineering (ICDE),
[11] H. Saxena, L. Golab, S. Idreos, and I. F. Ilyas, “Real-time LSM-trees April 2021, pp. 612–623.
for HTAP workloads,” in 2023 IEEE 39th International Conference on [28] L. Lu, T. S. Pillai, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau,
Data Engineering (ICDE), April 2023, pp. 1208–1220. “WiscKey: Separating keys from values in SSD-conscious storage,”
[12] O. Balmau, F. Dinu, W. Zwaenepoel, K. Gupta, R. Chandhiramoorthi, in 14th USENIX Conference on File and Storage Technologies (FAST
and D. Didona, “SILK: Preventing latency spikes in Log-Structured 16). Santa Clara, CA: USENIX Association, February 2016, pp.
merge Key-Value stores,” in 2019 USENIX Annual Technical Conference 133–148. [Online]. Available: https://www.usenix.org/conference/fast16/
(USENIX ATC 19). Renton, WA: USENIX Association, Jul. 2019, pp. technical-sessions/presentation/lu
753–766. [Online]. Available: https://www.usenix.org/conference/atc19/ [29] W. Kim, C. Park, D. Kim, H. Park, Y. ri Choi, A. Sussman,
presentation/balmau and B. Nam, “ListDB: Union of Write-Ahead logs and persistent
[13] T. Yao, Y. Zhang, J. Wan, Q. Cui, L. Tang, H. Jiang, C. Xie, SkipLists for incremental checkpointing on persistent memory,”
and X. He, “MatrixKV: Reducing write stalls and write amplification in 16th USENIX Symposium on Operating Systems Design and
in LSM-tree based KV stores with matrix container in NVM,” Implementation (OSDI 22). Carlsbad, CA: USENIX Association,
in 2020 USENIX Annual Technical Conference (USENIX ATC 20). Jul. 2022, pp. 161–177. [Online]. Available: https://www.usenix.org/
USENIX Association, July 2020, pp. 17–31. [Online]. Available: conference/osdi22/presentation/kim
https://www.usenix.org/conference/atc20/presentation/yao [30] Z. Lu, Q. Cao, H. Jiang, S. Wang, and Y. Dong, “p2 KVS: A
[14] Y. Chen, Y. Lu, F. Yang, Q. Wang, Y. Wang, and J. Shu, “FlatStore: portable 2-dimensional parallelizing framework to improve scalability
An efficient log-structured key-value storage engine for persistent of key-value stores on SSDs,” in Proceedings of the Seventeenth
memory,” in Proceedings of the Twenty-Fifth International Conference European Conference on Computer Systems, ser. EuroSys ’22. New
on Architectural Support for Programming Languages and Operating York, NY, USA: Association for Computing Machinery, 2022, pp.
Systems, ser. ASPLOS ’20. New York, NY, USA: Association 575–591. [Online]. Available: https://doi.org/10.1145/3492321.3519567
for Computing Machinery, 2020, p. 1077–1091. [Online]. Available: [31] PingCAP-Hackthon2019-Team17, “Io-uring speed the rocksdb & tikv,”
https://doi.org/10.1145/3373376.3378515 October 2019, https://openinx.github.io/ppt/io-uring.pdf.
[15] A. Mahajan, “Write Stalls for RocksDB,” October 2021, https://github. [32] A. Cloud, “PhotonLibOS,” July 2022, https://github.com/alibaba/
com/facebook/rocksdb/wiki/Write-Stalls. PhotonLibOS.
[16] J. Yu, S. H. Noh, Y. ri Choi, and C. J. Xue, “ADOC: Automatically [33] L. Torvalds, “Re: [patch 09/13] aio: add support for async openat(),”
harmonizing dataflow between components in Log-Structured Key- January 2016, https://lwn.net/Articles/671657/.
Value stores for improved performance,” in 21st USENIX Conference [34] Intel, “Storage performance development kit,” January 2023, https:
on File and Storage Technologies (FAST 23). Santa Clara, CA: //spdk.io/.
USENIX Association, Feb. 2023, pp. 65–80. [Online]. Available: [35] J. Yang, D. B. Minturn, and F. Hady, “When poll is better than interrupt,”
https://www.usenix.org/conference/fast23/presentation/yu in Proceedings of the 10th USENIX Conference on File and Storage
[17] Z. Zhang, Y. Yue, B. He, J. Xiong, M. Chen, L. Zhang, and N. Sun, Technologies, ser. FAST’12. USA: USENIX Association, Feb 2012,
“Pipelined compaction for the LSM-tree,” in 2014 IEEE 28th Interna- p. 3.
tional Parallel and Distributed Processing Symposium, 2014, pp. 777– [36] H.-J. Kim, Y.-S. Lee, and J.-S. Kim, “NVMeDirect: A user-space I/O
786. framework for application-specific optimization on NVMe SSDs,” in
Proceedings of the 8th USENIX Conference on Hot Topics in Storage York, NY, USA: Association for Computing Machinery, 2021, p.
and File Systems, ser. HotStorage’16. USA: USENIX Association, 280–294. [Online]. Available: https://doi.org/10.1145/3477132.3483593
2016, p. 41–45. [51] F. Wu, M.-H. Yang, B. Zhang, and D. H. Du, “AC-Key: Adaptive
[37] B. Peng, H. Zhang, J. Yao, Y. Dong, Y. Xu, and H. Guan, caching for LSM-based key-value stores,” in 2020 USENIX Annual
“MDev-NVMe: A NVMe storage virtualization solution with mediated Technical Conference (USENIX ATC 20). USENIX Association,
Pass-Through,” in 2018 USENIX Annual Technical Conference July 2020, pp. 603–615. [Online]. Available: https://www.usenix.org/
(USENIX ATC 18). Boston, MA: USENIX Association, Jul. 2018, pp. conference/atc20/presentation/wu-fenggang
665–676. [Online]. Available: https://www.usenix.org/conference/atc18/ [52] W. Zhong, C. Chen, X. Wu, and S. Jiang, “REMIX: Efficient range
presentation/peng query for LSM-trees,” in 19th USENIX Conference on File and
[38] L. of the io uring, “io uring setup,” https://unixism.net/loti/ref-iouring/ Storage Technologies (FAST 21). USENIX Association, Feb. 2021, pp.
io uring setup.html, June 2020. 51–64. [Online]. Available: https://www.usenix.org/conference/fast21/
[39] Y. Won, J. Jung, G. Choi, J. Oh, S. Son, J. Hwang, and S. Cho, “Barrier- presentation/zhong
enabled IO stack for flash storage,” in Proceedings of the 16th USENIX [53] S. Sarkar, N. Dayan, and M. Athanassoulis, “The LSM design space and
Conference on File and Storage Technologies, ser. FAST’18. USA: its read optimizations,” in 2023 IEEE 39th International Conference on
USENIX Association, 2018, p. 211–226. Data Engineering (ICDE), April 2023, pp. 3578–3584.
[40] T. kernel development community, “Multi-queue block IO queueing [54] J. Liang and Y. Chai, “CruiseDB: An LSM-tree key-value store with both
mechanism (blk-mq),” https://www.kernel.org/doc/html/latest/block/ better tail throughput and tail latency,” in 2021 IEEE 37th International
blk-mq.html#multi-queue-block-io-queueing-mechanism-blk-mq. Conference on Data Engineering (ICDE), 2021, pp. 1032–1043.
[41] S. Roesch, “Re: [PATCH v7 00/15] io-uring/xfs: support [55] D. Teng, L. Guo, R. Lee, F. Chen, S. Ma, Y. Zhang, and X. Zhang,
async buffered writes,” https://lore.kernel.org/linux-mm/ “LSbM-tree: Re-enabling buffer caching in data management for mixed
[email protected]/, June 2022. reads and writes,” in 2017 IEEE 37th International Conference on
[42] P. Administrator, “Linux 5.20 to support async buffered writes Distributed Computing Systems (ICDCS), 2017, pp. 68–79.
for XFS + io uring for big performance boost,” https://www. [56] P. Raju, R. Kadekodi, V. Chidambaram, and I. Abraham, “PebblesDB:
phoronix.com/forums/forum/software/general-linux-open-source/ Building key-value stores using fragmented log-structured merge
1330236-linux-5-20-to-support-async-buffered-writes-for-xfs-io trees,” in Proceedings of the 26th Symposium on Operating Systems
uring-for-big-performance-boost, June 2022. Principles, ser. SOSP ’17. New York, NY, USA: Association
[43] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, for Computing Machinery, 2017, p. 497–514. [Online]. Available:
“Benchmarking cloud serving systems with YCSB,” in Proceedings of https://doi.org/10.1145/3132747.3132765
the 1st ACM Symposium on Cloud Computing, ser. SoCC ’10. New [57] H. Huang and S. Ghandeharizadeh, “Nova-LSM: A distributed,
York, NY, USA: ACM, 2010, pp. 143–154. component-based LSM-tree key-value store,” in Proceedings of
[44] B. Chen, “200 lines of code to rewrite the 600’000 lines RocksDB the 2021 International Conference on Management of Data, ser.
into a coroutine program,” December 2022, https://github.com/facebook/ SIGMOD/PODS ’21. New York, NY, USA: Association for
rocksdb/issues/11017. Computing Machinery, 2021, p. 749–763. [Online]. Available: https:
[45] Y. Kang, X. Huang, S. Song, L. Zhang, J. Qiao, C. Wang, J. Wang, and //doi.org/10.1145/3448016.3457297
J. Feinauer, “Separation or not: On handing out-of-order time-series data [58] T. Zhang, J. Wang, X. Cheng, H. Xu, N. Yu, G. Huang, T. Zhang, D. He,
in leveled LSM-tree,” in 2022 IEEE 38th International Conference on F. Li, W. Cao, Z. Huang, and J. Sun, “FPGA-accelerated compactions
Data Engineering (ICDE), 2022, pp. 3340–3352. for LSM-based key-value store,” in 18th USENIX Conference on File
[46] X. Wang, P. Jin, B. Hua, H. Long, and W. Huang, “Reducing write and Storage Technologies (FAST 20). Santa Clara, CA: USENIX
amplification of LSM-tree with block-grained compaction,” in 2022 Association, February 2020, pp. 225–237. [Online]. Available:
IEEE 38th International Conference on Data Engineering (ICDE), 2022, https://www.usenix.org/conference/fast20/presentation/zhang-teng
pp. 3119–3131. [59] X. Sun, J. Yu, Z. Zhou, and C. J. Xue, “FPGA-based compaction
[47] Y. Zhong, Z. Shen, Z. Yu, and J. Shu, “Redesigning high-performance engine for accelerating LSM-tree key-value stores,” in 2020 IEEE 36th
LSM-based key-value stores with persistent CPU caches,” in 2023 IEEE International Conference on Data Engineering (ICDE), 2020, pp. 1261–
39th International Conference on Data Engineering (ICDE), 2023, pp. 1272.
1098–1111. [60] Y. Zhang, H. Hu, X. Zhou, E. Xie, H. Ren, and L. Jin, “PM-Blade: A
[48] Y.-S. Chang, Y. Hsiao, T.-C. Lin, C.-W. Tsao, C.-F. Wu, Y.-H. Chang, persistent memory augmented LSM-tree storage for database,” in 2023
H.-S. Ko, and Y.-F. Chen, “Determinizing crash behavior with a IEEE 39th International Conference on Data Engineering (ICDE), April
verified Snapshot-Consistent flash translation layer,” in 14th USENIX 2023, pp. 3363–3375.
Symposium on Operating Systems Design and Implementation (OSDI [61] L. Benson, H. Makait, and T. Rabl, “Viper: An efficient hybrid
20). USENIX Association, Nov. 2020, pp. 81–97. [Online]. Available: PMem-DRAM key-value store,” Proc. VLDB Endow., vol. 14,
https://www.usenix.org/conference/osdi20/presentation/chang no. 9, pp. 1544–1556, may 2021. [Online]. Available: https:
[49] H. Li, M. L. Putra, R. Shi, X. Lin, G. R. Ganger, and H. S. Gunawi, //doi.org/10.14778/3461535.3461543
“LODA: A host/device co-design for strong predictability contract [62] P. Alcorn, “Intel kills Optane memory business, pays $559
on modern flash storage,” in Proceedings of the ACM SIGOPS 28th million inventory write-off,” https://www.tomshardware.com/news/
Symposium on Operating Systems Principles, ser. SOSP ’21. New intel-kills-optane-memory-business-for-good, August 2022.
York, NY, USA: Association for Computing Machinery, 2021, p. [63] S. Zhong, C. Ye, G. Hu, S. Qu, A. Arpaci-Dusseau, R. Arpaci-Dusseau,
263–279. [Online]. Available: https://doi.org/10.1145/3477132.3483573 and M. Swift, “MadFS: Per-file virtualization for userspace persistent
[50] J. Park and Y. I. Eom, “FragPicker: A new defragmentation tool for memory filesystems,” in Proceedings of the 21st USENIX Conference
modern storage devices,” in Proceedings of the ACM SIGOPS 28th on File and Storage Technologies, ser. FAST’23. USA: USENIX
Symposium on Operating Systems Principles, ser. SOSP ’21. New Association, Feb. 2023, p. 1–15.

You might also like