Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views35 pages

Rocks DB

RocksDB is a high-performance key-value store developed by Facebook, designed for low latency and high throughput in large-scale applications. It utilizes a Log-Structured Merge-Tree (LSM tree) architecture that optimizes write performance and data integrity through features like write-ahead logging, configurable compaction strategies, and pluggable components. The database's modular architecture allows for extensive customization, supporting various data structures and concurrency controls to cater to diverse application requirements.

Uploaded by

Mohammed Imran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views35 pages

Rocks DB

RocksDB is a high-performance key-value store developed by Facebook, designed for low latency and high throughput in large-scale applications. It utilizes a Log-Structured Merge-Tree (LSM tree) architecture that optimizes write performance and data integrity through features like write-ahead logging, configurable compaction strategies, and pluggable components. The database's modular architecture allows for extensive customization, supporting various data structures and concurrency controls to cater to diverse application requirements.

Uploaded by

Mohammed Imran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Introduction – A Brief Overview

Evolution and design goals of RocksDB

RocksDB is a high-performance, embeddable key-value store created at Facebook to meet


the demanding requirements of large-scale applications that need low latency and high
throughput. Its lineage can be traced back to Google’s LevelDB, which introduced the
concept of a Log-Structured Merge-Tree (LSM tree) for e iciently handling write-heavy
workloads. RocksDB extends LevelDB’s core ideas with advanced features such as
configurable compaction strategies, multi-threaded background tasks, column families,
snapshots, transactions, and pluggable components for caching, compression, and
environment abstractions. These enhancements make RocksDB suitable for a broad
spectrum of use cases—from caching layers and message queues to embedded
databases within data processing systems. The design emphasizes write performance and
data integrity: changes are first appended to a write-ahead log (WAL), then written into an
in-memory structure called a memtable. When the memtable reaches its capacity, it is
flushed to disk as a sorted, immutable file known as a Sorted String Table (SST). To keep
on-disk data well-organized, background compaction processes merge SST files into lower
levels, discarding obsolete entries and rewriting keys in order. The interplay between the
write path, storage levels, compaction engine, and read path forms the backbone of
RocksDB’s architectureartem.krylysov.com.

From an architectural standpoint, RocksDB follows a modular approach. Each major


subsystem (e.g., memtables, WAL, compaction, caching, environment abstraction) is
pluggable, enabling users to tailor the database to their hardware and application
requirements. For instance, one can choose between di erent memtable implementations
(skip list vs. hash table), table formats (block-based vs. plain table), compression
algorithms (Snappy, LZ4, ZSTD, etc.), and environment backends (POSIX file system vs.
Hadoop HDFS vs. Amazon S3). This flexibility is complemented by extensive configuration
parameters and runtime option tuning, which can be modified on the fly for fine-grained
control over performance. Moreover, RocksDB introduces column families to isolate
groups of keys with distinct options and manage them independentlygithub.com. This
feature is important for multi-tenant systems or for separating di erent data types within
the same application. RocksDB also supports ACID transactions (both optimistic and
pessimistic), snapshots, and merge operators, enabling advanced concurrency and
atomicity semantics.

Importance of LSM tree architecture


The core of RocksDB’s performance lies in the LSM tree structure. Traditional B-tree
databases update data in place, resulting in random I/O patterns that degrade performance
on spinning disks and limit throughput even on SSDs. In contrast, an LSM tree appends
writes sequentially: new writes are first staged in a memtable and logged in the WAL; when
the memtable is flushed, data is written in large contiguous blocks to disk. This design
yields high write throughput by minimizing random I/O. However, appending data results in
multiple versions of keys scattered across di erent SST files. To maintain space e iciency
and fast reads, RocksDB compacts SST files in the background—merging, sorting, and
removing obsolete entries. The compaction process also ensures that data in lower levels
of the LSM tree does not contain overlapping key ranges (for leveled compaction). Thus, the
LSM tree organizes data across multiple levels, with each level containing a set of SST files.
Level 0 (L0) files can overlap arbitrarily, while levels ≥1 have increasingly larger capacities
and maintain non-overlapping key ranges. RocksDB implements various compaction
strategies (leveled, universal, FIFO) to balance write amplification, read amplification, and
space amplification depending on workload requirementsartem.krylysov.com.

Overview of key subsystems

Understanding RocksDB’s architecture requires examining its critical subsystems:

1. Write Path: When a client issues a write (Put, Delete, Merge), the change is first
recorded in the WAL to guarantee durability. The memtable holds these changes in
memory, ensuring they are sorted by key. Once the memtable is full, it is flushed to
disk as an SST file. Optional features like write batching, two-phase commit, and
concurrency controls modify the way writes interact with the WAL and memtable.

2. Read Path: Reads consult multiple structures: the active memtable, immutable
memtables waiting to be flushed, the cache for data and metadata blocks, and SST
files in various levels. RocksDB provides iterators that unify scanning across these
sources. Bloom filters help quickly determine if a key is absent from an SST file,
avoiding unnecessary disk I/Oartem.krylysov.com.

3. Compaction Engine: To manage space and maintain read e iciency, RocksDB


triggers compactions. The compaction engine selects a set of SST files from one
level and merges them with files from the next lower level. During merging, obsolete
versions of keys (e.g., overwritten or deleted records) are dropped. Di erent
compaction modes (e.g., leveled and universal) have trade-o s in write
amplification and read amplification. A variety of compaction filters and rate limiters
ensure that compactions do not overwhelm underlying storageartem.krylysov.com.
4. Caching Subsystem: RocksDB o ers a unified block cache that stores
uncompressed data blocks, an optional compressed block cache that stores
compressed blocks, and a metadata cache that stores index and filter blocks.
These caches reduce read latency by serving popular data directly from memory.
The caches are sharded to reduce lock contention, and priority pools reserve a
portion for critical metadatagithub.com.

5. Column Families and Version Management: Each column family has independent
configuration (e.g., memtable size, compaction strategy, compression). The version
set is a data structure that records the state of SST files across all column families.
A MANIFEST file logs version edits that describe adding or removing files. Opening
the database involves reading the MANIFEST to reconstruct the latest consistent
super-versiongithub.com. Snapshots rely on sequence numbers to provide
point-in-time views of the databasegithub.com.

6. Transactions and Concurrency Control: RocksDB supports both pessimistic and


optimistic transactions. Pessimistic transactions acquire locks on keys to avoid
conflicts, while optimistic transactions detect conflicts at commit time. Snapshot
isolation is provided via sequence numbers, and write batching reduces the
overhead of multiple small writesgithub.com.

7. Environment Abstraction (Env): RocksDB interacts with files and I/O via a
pluggable Env layer that abstracts the underlying file system (POSIX, HDFS, S3). The
Env also manages thread pools, flushes, compactions, and rate limiting. Advanced
features like direct I/O (bypassing the OS page cache) and I/O throttling are
configured through Env optionsgithub.com.

8. Pluggable Tables and Formats: RocksDB supports di erent table formats,


including the default block-based table (SST) and the plain table for in-memory or
large sequential datasets. It also allows custom table factories to implement
specialized formats or compaction strategiesgithub.com.

9. Statistics and Instrumentation: Detailed statistics (tickers and histograms)


monitor internal events such as block cache hits, compaction time, and write stall
occurrences. Perf contexts and I/O statistics contexts track per-operation
metrics. Verbose logging and event tracing help diagnose performance
issuesgithub.com.

10. Utility and Testing Tools: The ldb CLI and sst_dump provide administrative
operations for inspecting, dumping, and repairing databasesgithub.com.
Benchmark tools like db_bench allow developers to measure RocksDB performance
under various workloadsgithub.com. Built-in stress tests and corruption tests
ensure reliability across releases.

These subsystems interact to deliver a flexible, high-performance storage engine. In the


following chapters, we explore each area in detail, examining the internal data structures,
algorithms, and design choices that make RocksDB a powerful component of modern data
infrastructure.

Write Path Core (MemTable + WAL)

Overview of the Write Path

RocksDB’s write path ensures that modifications to the database are durable, consistent,
and e iciently persisted to disk. When an application calls DB::Put(), DB::Delete(), or
DB::Merge(), the operation is encoded into a write batch and appended to the write-ahead
log (WAL). Appending to the WAL is the first step because it provides durability: if the
process crashes, the database can replay the WAL entries to recover the latest state. After
logging, the update is applied to the active memtable—an in-memory data structure that
stores keys in sorted order. The memtable is typically implemented as a skip list (the
default) or a prefix hash table; alternative implementations (such as vector memtables)
can be configured via the memtable_factory option. For high concurrency, memtable write
operations are protected by a mutex, but RocksDB can group multiple writes into a batch to
reduce locking overhead. Write batches preserve the order of operations and allow atomic
updates across column familiesartem.krylysov.com.

Each write batch is assigned a monotonically increasing sequence number. Sequence


numbers play a crucial role in maintaining ordering across writes, snapshots, and
compaction. When a client sets a snapshot, RocksDB records the current sequence
number; subsequent writes carry larger sequence numbers and are considered newer than
the snapshot. During reads, RocksDB hides entries with sequence numbers greater than
the snapshot’s sequence number to provide snapshot isolation. The sequence number is
also encoded in the internal key structure, which includes the user key, an 8-byte sequence
number, and a value type (e.g., kTypeValue, kTypeDeletion, kTypeMerge). Internal keys
ensure that older versions of a key (with larger sequence numbers) appear earlier in sorted
order, enabling e icient merges and compaction.

Write-Ahead Log (WAL) Format and Durability

The WAL is an append-only file that records every modification before it is applied to the
memtable. Each record in the WAL includes a header with a CRC32 checksum, length
information, and a payload containing serialized write batch operations. The records are
grouped into log blocks (32 KB by default), and each block begins with a block trailer that
includes a flag (indicating if the block is full), the CRC of the block, and padding. A crash
can occur at any point, so the WAL is designed for safe recovery: RocksDB scans the WAL
for valid checksums and replays complete records until a checksum mismatch or an
incomplete record is encountered. Partial or corrupted records at the end are ignored. After
a memtable is flushed to disk, RocksDB discards the corresponding WAL if there are no
other memtables referencing it; this helps reclaim disk space. The WAL can be configured
to use synchronous fsync() on every write (Options::wal_fsync), group commit (batching
small writes), or asynchronous flushing for improved throughput. For faster log writes,
RocksDB supports two-phase commit: writes are appended to the WAL and applied to the
memtable while being replicated or persisted asynchronously.

Memtable Structure and Implementation Variants

The memtable is a central component of the write path. It holds recently written key-value
pairs until it is full, at which point it is converted into an immutable memtable and queued
for flushing. The default memtable implementation uses a skip list—a probabilistic data
structure that provides fast search, insertion, and deletion with logarithmic complexity.
Skip lists maintain multiple levels of pointers, enabling quick traversal across the key
space. Each entry in the memtable stores an internal key (user key + sequence number +
value type) and a pointer to the value. Because the memtable holds entries in sorted order,
range scans and prefix queries are e icient. RocksDB allows users to choose other
memtable types:

 Hash skip list: This memtable uses a hash table to divide the key space into
buckets, with each bucket storing a skip list. It reduces the overhead of skip list level
pointers and improves CPU cache locality. It is beneficial for workloads with a large
key space and random writes.

 Vector memtable (VectorMemtable): This implementation uses a dynamic array


(std::vector) to store keys and values sequentially. It is optimized for write-once
workloads where keys are inserted in non-decreasing order. Vector memtables
enable simpler and faster writes because there is no random insertion cost.
However, random insertions degrade performance and cause memory copying.

 Hash linked list: This memtable uses a hash table that maps a key’s prefix to a
linked list of entries. It is used when a custom prefix extractor is specified; keys are
hashed based on a common prefix, providing constant-time insert and search. This
memtable is e icient for workloads with many updates to the same prefix and
eliminates overhead of skip lists.
Users select a memtable implementation through Options::memtable_factory. The choice
influences memory usage, CPU cache e iciency, and write latency. Regardless of the type,
memtables support concurrent writes by multiple threads. A mutex protects the
memtable’s internal data structures, but RocksDB groups writes into batches to reduce the
number of mutex acquisitions. When multiple threads call Write(), the first thread becomes
the leader and collects writes into a group (forming a WriteBatch). The leader writes the
batch to the WAL and applies it to the memtable, while other threads wait. This reduces
context switches and system call overhead.

Immutable Memtables and Flush Process

When the active memtable’s size exceeds the configured threshold (controlled by
write_bu er_size), RocksDB rotates it: the current memtable becomes immutable
(read-only), and a new memtable is allocated for subsequent writes. The immutable
memtable is inserted into a queue of flush candidates. A background thread (managed by
the Env) picks up flush tasks and writes the contents of the immutable memtable to disk as
an SST file in Level 0. The flush process sorts the entries (they are already sorted by key
order), writes them into compressed data blocks, generates index blocks and filter blocks,
and writes the file’s metadata. After flushing, RocksDB records the new file in the
MANIFEST by generating a version edit that adds the new file metadata. The associated
WAL is removed if no other memtables depend on it. The flush operation is performed
concurrently with other DB operations; to avoid write stalls, max_background_flushes or
max_background_jobs can be increased to allocate more threads for flushinggithub.com.

An important optimization is write throttling: if the number of Level 0 files grows too large
(beyond level0_stop_writes_trigger), RocksDB may stall writes to prevent the compaction
backlog from growing unbounded. Similarly, if there are too many immutable memtables
waiting to flush, writes may be stalled until flushers catch up. The rate limiter (discussed
later) can also be applied to flush operations to smooth I/O bursts and avoid saturating
disks. Flushing may produce multiple SST files if the memtable is large; the generated files
are named sequentially using the next file number from the DB metadata. Optionally, a
compaction filter can run during flush to decide whether to keep or drop each key (useful
for TTL or application-specific purging). A separate flush scheduler prioritizes flush
operations over compaction jobs to ensure that fresh data is quickly made durable.

Write Batching and Group Commit

RocksDB uses write batching to improve write throughput and reduce disk I/O. When
multiple client threads call DB::Write(), they can be grouped into a single batch by the
thread performing the WAL write. The leader thread collects operations from other waiting
writers, aggregates them into a single WriteBatch, and writes the combined batch to the
WAL. A group commit reduces system call overhead and amortizes the cost of syncing the
log to disk. After the WAL write completes, the leader thread applies the batch to the
memtable. Each writer still receives its own status (success or failure), and sequence
numbers ensure global ordering. Batching can be controlled via
WriteOptions::disableWAL, WriteOptions::sync, and WriteOptions::timeout_hint_us. For
example, setting sync=true forces the WAL to be synced to disk before returning; this is
slower but ensures durability. Meanwhile, disableWAL can be used for pure caching
scenarios where durability is not needed.

An additional optimization is atomic writes across column families. RocksDB allows a


WriteBatch to contain updates to multiple column families. The batch is written to the WAL
and applied to all the referenced memtables atomically. This capability is critical because
column families share the same WAL: a flush of one column family triggers a new log file
that must also be referenced by other families until they flushgithub.com. Atomic
multi-family writes ensure consistency across families and simplify application logic.
Internally, column family identifiers are encoded in the WriteBatch records, and memtable
iterators can identify which family an entry belongs to.

Sequence Numbers and Internal Key Format

Every record inserted into RocksDB is assigned a unique, monotonically increasing


sequence number. Sequence numbers enforce ordering between writes and are used
extensively during reads, compaction, and snapshots. Internally, RocksDB combines the
user key, sequence number, and value type into a 16-byte internal key. The layout is
[user_key][8-byte sequence number][1-byte type]. The type indicates whether the record is
a value, a deletion, a merge operand, or a single deletion. Because sequence numbers
decrease when sorting internal keys (high sequence numbers come first), the most recent
value for a key appears earlier in a sorted scan. This property simplifies merging multiple
sources during reads and compactions: the first instance of a user key encountered is the
newest version, and older versions can be dropped if not needed by snapshots. Sequence
numbers also help maintain snapshot isolation: snapshot reads filter out entries whose
sequence number is greater than the snapshot’s sequence numbergithub.com.

The sequence number is stored in the VersionEdit records in the MANIFEST when new files
are added. When the database opens, RocksDB reads the manifest to find the latest
last_sequence number and sets the internal counter accordingly. The last_sequence is
updated when a write batch is applied; after writing to the WAL and memtable, the new
last_sequence value (last applied sequence) is updated in memory and persisted to the
manifest when a new flush or compaction occurs.
WAL Recycling and Preallocation

To mitigate the cost of frequent log file creation, RocksDB can recycle old WAL files. When
a WAL is no longer needed (all memtables referencing it have been flushed), RocksDB
deletes or recycles it for future use. Recycling avoids repeated file creation and deletion on
storage devices that penalize such operations. The WAL file size is controlled via
Options::recycle_log_file_num (number of logs to keep around) and
Options::wal_size_limit_MB (size threshold for log recycling). Additionally, RocksDB
preallocates space in the WAL file using Options::allow_concurrent_memtable_write and
wal_preallocate_size to reduce fragmentation and minimize file system overhead.
Preallocation writes zero bytes to the file to reserve space on disk, but actual log writes still
append at the current o set. When a log file is recycled, RocksDB writes a sequence
number header to indicate the log’s starting sequence number; during recovery, the system
uses this header to skip ahead to the correct starting point. WAL preallocation and
recycling reduce disk fragmentation and improve throughput on file systems with high cost
for small writes or file creation.

Write Stalls and Backpressure Mechanisms

RocksDB uses write stalls to prevent unbounded growth of memtables and Level 0 files. If
the number of immutable memtables waiting to flush exceeds max_write_bu er_number
minus two, new writes are stalled until at least one flush completes. Similarly, if the
number of Level 0 SST files exceeds level0_stop_writes_trigger, writes are blocked to allow
compactions to catch up. Once the number of Level 0 files falls below
level0_slowdown_writes_trigger, writes are allowed but may be slowed down using the rate
limiter. The soft_pending_compaction_bytes_limit and
hard_pending_compaction_bytes_limit options also trigger write throttling if the total bytes
pending compaction exceed the limits. These mechanisms ensure that the database does
not use unbounded disk space or degrade read performance due to excessive compaction
backlog. Users can monitor stall statistics via DB properties and adjust compaction
options (e.g., target_file_size_base, level_compaction_dynamic_level_bytes,
compaction_style) to mitigate stalls.

Bypassing the WAL and Non-Durable Writes

RocksDB allows disabling the write-ahead log for scenarios where durability is not critical.
By setting WriteOptions::disableWAL = true, updates are applied directly to the memtable
without logging to the WAL. This mode reduces latency because there is no log I/O, but it
exposes the application to potential data loss if the process crashes before the memtable
flushes to disk. Non-durable writes are sometimes used for caching or when the
application uses another durability mechanism (e.g., replicating writes to a remote
system). Similarly, WriteOptions::sync = false allows the WAL write to return before the
data is flushed to disk; the OS may bu er the log, and a power failure could cause data
loss. Users should carefully evaluate their durability requirements before disabling the WAL
or synchronous syncing. RocksDB also o ers a manual WAL flush API (DB::FlushWAL) to
force the WAL to be persisted at a specific point, for example, after a batch of
non-transactional writes that need to be durable.

LSM-Tree Storage Levels (SSTables)

Overview of SST Files and Levels

After the memtable flush, data is persisted in the form of Sorted String Table (SST) files.
SST files are immutable, sorted, compressed collections of key-value pairs. Each file stores
a small key range and is assigned to a level in the LSM tree. Level 0 files can overlap
arbitrarily (since they are produced by flush operations at di erent times). Starting from
Level 1 and below, RocksDB ensures that files have non-overlapping key ranges within the
same level. Each level has a target size, and when the total size of SST files in a level
exceeds the threshold, a compaction is triggered to merge files into the next level. The
highest level (e.g., Level 6 in default configuration) contains the largest amount of data and
is typically not compacted further; data at this level is considered cold. This hierarchical
arrangement helps manage data freshness and frequency of access: newer data resides in
higher levels (L0/L1) and is accessed more frequently, while older data resides in lower
levels.

An SST file has several components:

1. File metadata: Each file is identified by a unique number and stores metadata such
as smallest and largest keys (user keys), the file’s level, size, and sequence
numbers. This metadata is recorded in the MANIFEST during flush and compaction.

2. Data blocks: The core of the file; each block contains a sequence of key-value
pairs. Blocks are typically 4 KB or 16 KB (configurable via block_size); keys are
delta-encoded relative to the previous key to save space. Data blocks are
compressed depending on the configured compression algorithm (Snappy, LZ4,
ZSTD, etc.).

3. Index block: A small block that maps the last key of each data block to the block’s
o set in the file. The index block helps quickly locate a block containing a target key
by performing binary search on the last keys.
4. Filter block: Optionally, a Bloom filter or other filter structure for each data block.
Filters provide a probabilistic guarantee that a key is not present in the block,
reducing unnecessary disk reads. The filter block can be partitioned to reduce
overhead and can be pinned into the cache for faster lookupsartem.krylysov.com.

5. Metaindex block: A map from block type names to their locations in the file. For
example, the filter block is located via the metaindex. This block simplifies the
addition of new block types without changing the file format.

6. Footer: Contains fixed-size information at the end of the file, including the locations
of the metaindex block and index block, a magic number to identify the file, and the
file version. The footer is read first when opening a file to locate other structures.

The default block-based table format is e icient for disk storage and caching. A second
table format, plain table, optimizes for in-memory or memory-mapped use cases by
storing keys and values sequentially and using a hash index for direct lookupsgithub.com.
Plain tables forego compression and delta encoding, resulting in faster reads at the cost of
higher space usage; they are recommended for applications where all data resides in
memory.

Manifest and VersionSet

The MANIFEST is a crucial file that records changes to the LSM tree structure. RocksDB
cannot update SST files in place because they are immutable; instead, it writes small logs
called VersionEdit entries to the MANIFEST whenever a new SST file is created or deleted.
Each VersionEdit describes modifications such as adding a file to a level, deleting a file,
setting the next file number, updating the last sequence number, or changing database
options. VersionEdit entries also include optional fields for column family IDs and
comparator name to maintain compatibility across database versionsgithub.com. The
manifest is essentially a write-ahead log for the metadata of the database: it ensures that
operations altering the set of files are durably recorded before they take e ect.

During startup, RocksDB reads the MANIFEST log sequentially and reconstructs the latest
VersionSet—an in-memory representation of the LSM tree. The VersionSet contains a list
of Versions, each representing a consistent snapshot of the database state at a particular
point in time. Each Version points to a set of files at each level along with their key ranges.
The current version (also called super-version) is used for reads. When compaction
produces new files and deletes old ones, a new version is created by applying the
VersionEdits. Versions form a linked list so that old versions can be accessed by iterators
until all outstanding reads referencing them finish. To avoid unbounded growth of the
MANIFEST, RocksDB periodically writes a manifest snapshot that includes the full
database metadata in a new MANIFEST file, truncating old logs. This process is triggered
when the current MANIFEST exceeds a threshold (e.g., 128 MB). The CURRENT file, a small
file containing the name of the latest manifest, is atomically updated when a new manifest
is createdgithub.com.

Levels and Target Sizes

RocksDB organizes SST files into levels with exponentially increasing target sizes. The base
level (often Level 1) has a size limit set by Options::max_bytes_for_level_base, and
subsequent levels’ size limits are multiplied by the level multiplier (default 10). For
example, if Level 1 has a target size of 100 MB, Level 2 has 1 GB, Level 3 has 10 GB, and so
on. Level 0 has a separate limit on the number of files
(Options::level0_file_num_compaction_trigger); when the number of L0 files exceeds this
limit, a compaction merges a subset of L0 files into Level 1. Because Level 0 files overlap
arbitrarily, reading a key may require checking every L0 file, resulting in high read
amplification. Therefore, controlling the number of L0 files is important for read
performance. Level sizes and compaction triggers are tunable parameters that balance
write amplification (more compactions) against read amplification (smaller number of files
to search).

An optional configuration is dynamic level bytes


(Options::level_compaction_dynamic_level_bytes), which allows RocksDB to adjust level
target sizes based on the total data size dynamically. When dynamic level bytes are
enabled, RocksDB determines the base level such that the last level’s size is roughly equal
to the total database size. This dynamic strategy better utilizes disk space and reduces the
number of levels for large datasets.

File Naming and Lifecycles

SST files are assigned monotonically increasing file numbers. The naming convention uses
numeric file names followed by file extensions: .log for WAL, .sst for data files, .manifest for
the manifest, and .ldb (an older extension for SST files). Files are created in the database
directory or in multiple data directories specified by Options::db_paths and
Options::db_log_dir. Data directories can be used to store older levels on slower disks; for
example, L0 and L1 can reside on SSDs while L4 and L5 reside on HDDs. RocksDB uses the
next available file number (stored in the manifest) when allocating new SST files. When a
file is obsolete—meaning it is not referenced by any version or not visible to snapshots—
RocksDB schedules it for deletion. File deletion occurs asynchronously in a background
thread to avoid blocking the foreground operations.
New files generated by compactions are also placed in the appropriate data directories
based on their level and the configured db_paths sizes. Each db_paths entry has a
target_size parameter; RocksDB distributes SST files across these directories to balance
space usage. For example, if db_paths is set to [(path1, 1TB), (path2, 500GB)], RocksDB will
store more files in path1 until its allocated space is used up.

Table Properties and Statistics

Each SST file stores table properties and metadata that RocksDB uses to optimize
compaction and read operations. Table properties include statistics such as number of
entries, number of data blocks, raw key bytes, raw value bytes, user collected properties,
file creation time, compression name, and filter policy. These properties are stored in the
file and accessible via TablePropertiesCollector. For example, the compact range API uses
table properties to estimate how many entries and bytes will be written during compaction.
Table properties can also record custom information (user collected properties) provided
by a user-implemented collector; for instance, an application can record the number of
unique user IDs per file.

RocksDB also provides a table cache and table reader that handle open file handles and
reference counted file objects. Opening an SST file can be expensive, so RocksDB caches
table readers in the table cache (backed by LRU or Clock cache). The table cache ensures
that frequently accessed files remain open and accessible. The table cache index uses
the file number as a key and stores pointers to table readers. When reading from a file,
RocksDB obtains the table reader from the cache; if not present, it loads the file from disk,
caches it, and returns the reader. Closing a table reader returns it to the cache; if the file is
no longer referenced by any iterators or by the table cache, it is closed and resources are
freed.

Prefix Extractors and Partitioned Filters

RocksDB supports prefix extractors to optimize reads for workloads where keys share
common prefixes. A prefix extractor is a function that maps a user key to a short prefix (e.g.,
extracting the first 8 bytes of a 16-byte key). When a prefix extractor is specified, RocksDB
builds a prefix bloom filter that checks whether any keys with the prefix exist in a file. The
filter is stored in the filter block and can be partitioned to reduce memory usage.
Partitioned filters divide the filter block into partitions corresponding to groups of data
blocks; this design reduces the cost of loading entire filter blocks into cache and allows
pinning only the top-level filter partitions. Prefix filtering helps skip files quickly when
scanning or seeking by a prefix.
When using the plain table format, a prefix extractor is mandatory because the table’s
indexing relies on hashing prefixes. The PlainTableFactory accepts options like
user_prefix_length, hash_table_ratio, and index_sparseness to configure how the table
indexes keys. Because plain tables store keys sequentially, prefix extractors drastically
improve search performance.

Level 0: Special Considerations

Level 0 plays a unique role in the LSM hierarchy because it contains newly flushed files.
Level 0 files can overlap arbitrarily, so reads must check each file to find a key. This can be
expensive if there are many L0 files. To mitigate the impact, RocksDB uses L0 compaction.
When the number of L0 files exceeds level0_file_num_compaction_trigger, RocksDB picks
a subset of them (based on size and key ranges) and merges them with overlapping files in
Level 1. The merging is performed by a compaction worker thread. The result is new L1 files
with non-overlapping key ranges. L0 compactions may produce multiple files if the key
space is large. Because L0 files contain recent data, compaction filters should be careful
not to drop keys that may still be visible to ongoing snapshots or transactions.

An optional optimization is L0 subcompaction, which divides an L0 to L1 compaction into


multiple subjobs based on key ranges. This is controlled by
Options::max_subcompactions. Each subjob processes a partition of the key space, and
the results are later concatenated. Subcompaction improves multi-core utilization during
compaction but increases CPU and memory consumption. It also results in more SST files,
which may slightly increase read amplification.

Importance of Bloom Filters and Meta Block Caches

Bloom filters dramatically reduce disk I/O for negative lookups. Each SST file can include a
Bloom filter that allows RocksDB to quickly determine whether a key is absent from that
file. The filter uses a series of hash functions to set bits in a bit array; at query time, if any bit
is not set, the key is definitively not present. Bloom filters can be configured via
Options::filter_policy, and custom filter policies can be implemented. Filter blocks can be
stored in the block cache for fast access. To reduce the overhead of storing large filters,
RocksDB supports full filters (one filter for the entire file) and partitioned filters (filters
partitioned by block). Partitioned filters reduce memory usage and can be pinned
selectively. High-priority caching ensures that index and filter blocks remain in the cache
even under heavy memory pressuregithub.com.

Bloom filter false positive rates are tuned via bloom_bits_per_key (for block-based table) or
bloom_hash_table_ratio (for plain table). A higher number of bits per key reduces false
positives but increases filter size. The false positive rate typically ranges from 0.2% to 1%,
depending on configuration.

Leveraged I/O Patterns

Reading from an SST file generally follows this pattern: first, the index block is read from the
cache or disk; binary search identifies the data block that may contain the key. Then, the
data block is read. If a Bloom filter is present and indicates the key is absent, RocksDB
skips reading the data block. To reduce the number of random I/O operations, RocksDB
performs readahead (controlled by Options::advise_random_on_open and
Options::compaction_readahead_size), which prefetches adjacent blocks. Readahead is
particularly useful during compaction, where sequential reading of multiple blocks occurs.
RocksDB also optimizes sequential scans by using iterators that maintain the current
position in each file. The internal iterator merges results from memtables and SST files,
skipping ahead when keys are not found due to Bloom filter checks. Additional
optimization, key range pruning, uses smallest and largest keys recorded in file metadata
to skip entire files that do not overlap the query range.

Flush & Compaction Engine

Purpose of Flush and Compaction

RocksDB’s flush and compaction engine ensures durability, space e iciency, and read
performance. Flushing transforms an in-memory structure (immutable memtable) into a
persistent SST file. Compaction merges and rewrites SST files to reduce the number of
files, discard obsolete versions, and maintain sorted order. Without compaction, the LSM
tree would accumulate stale data and many overlapping files, leading to high read
amplification and wasted storage. Compaction controls the trade-o between write
amplification (more rewriting) and read amplification (more files to check). RocksDB
implements several compaction styles to adapt to di erent workloads. Each style defines
when and how files are merged. Compaction is executed by background threads
(compaction workers) managed by the Env’s low priority thread pool. High priority threads
are reserved for memtable flushes and critical tasksgithub.com.

Compaction Styles: Leveled, Universal, and FIFO

Leveled Compaction (default)

In leveled compaction (inspired by LevelDB), data is organized into multiple levels with
exponentially increasing target sizes. All files in Level 0 can overlap; files in Level 1 and
below have non-overlapping key ranges. The compaction strategy attempts to maintain a
limited number of files per level. When the total size of files in a level exceeds its target,
RocksDB selects a candidate file and identifies overlapping files in the next level to merge.
For example, if Level 1 exceeds its size limit, RocksDB picks the largest file from L1 and
merges it with overlapping files in L2. The resulting data is written into new files in L2. Old
files are deleted after the compaction finishes. Leveled compaction provides good read
performance because it bounds the number of files to search and maintains sorted order.
However, it may incur higher write amplification due to frequent rewriting of data across
levels. Parameters like level0_file_num_compaction_trigger, target_file_size_base,
target_file_size_multiplier, and max_compaction_bytes tune the compaction behavior.

Universal Compaction

Universal compaction groups SST files without dividing them into fixed levels. Instead, files
are merged based on size, time, or number of files. Universal compaction is beneficial for
workloads with heavy writes and deletes because it minimizes write amplification.
RocksDB selects several candidate files of similar size and merges them into a single larger
file. The merging is usually less frequent than leveled compaction but produces larger files
that may overlap widely. Universal compaction maintains the invariant that the total
number of files is limited by options.max_open_files and options.max_background_jobs. It
is particularly e ective for time-series data, log processing, and streaming workloads
where data is appended in order and reads are mostly sequential. Some parameters
include options.compaction_style = kCompactionStyleUniversal,
options.max_sequential_skip_in_iterations, and options.compaction_options_universal.

FIFO Compaction

FIFO compaction is designed for data with a time-to-live (TTL) or when older data can be
discarded entirely. In FIFO mode, SST files are placed in a single level (like Level 0), and
compaction simply deletes files in first-in first-out order based on a TTL or file count limit.
This mode sacrifices read e iciency for simplified management; there is no sorting or
merging of files, and writes are never rewritten. FIFO compaction reduces write
amplification to one (no rewriting) but results in high read amplification because many files
may need to be scanned. It is suitable for logs, caches, or streaming data where only the
latest data is relevant. Users set options.compaction_style = kCompactionStyleFIFO and
specify the TTL or file count limit accordingly.

Subcompaction and Parallelization

When a compaction job is triggered, RocksDB divides the work into subcompactions to
utilize multiple cores. A subcompaction is a partition of the key space within a compaction
job. RocksDB determines the number of subcompactions using
Options::max_subcompactions. Each subcompaction is assigned its own thread and
processes a contiguous range of keys. The data to be compacted is partitioned based on
the smallest and largest keys in the overlapping files; then, data blocks from these files are
assigned to subjobs. Each subjob merges the input blocks into new SST files. After all
subjobs complete, the resulting SST files are combined to form the final output.
Subcompaction reduces compaction latency and increases throughput on multi-core
systems. However, it increases memory usage (each subcompaction holds bu ers and
sorted output) and may increase the number of output files, slightly increasing read
amplification.

Compaction Filters and Garbage Collection

Compaction filters provide a mechanism to drop or modify key-value pairs during


compaction. Users implement the CompactionFilter interface, which defines a Filter()
method receiving a key, value, and an indicator of whether the key is currently overwritten
or deleted. The filter returns whether the entry should be kept, modified, or dropped. A
typical use case is to implement TTL: keys older than a certain timestamp can be removed
during compaction. Another use case is to remove tombstone entries (deletions) that are
no longer needed. Compaction filters run at compaction time and do not slow down writes
or reads, but they can increase compaction CPU usage. RocksDB provides
CompactionFilterFactory to create compaction filters per compaction job. It is essential to
ensure that compaction filters do not drop keys that are still visible to active snapshots or
transactions.

Compaction Pipelining and Rate Limiting

Compaction throughput can saturate disk bandwidth, starving other operations. RocksDB
uses a rate limiter to throttle compaction and flush writes, ensuring that they do not
exceed a configured throughputgithub.com. The rate limiter uses a token bucket algorithm
and supports multiple priority queues. Each compaction worker requests tokens before
writing to disk; if tokens are unavailable, the thread sleeps until tokens are refilled. The
refill_period_us parameter controls how frequently tokens are added; shorter periods
reduce bursts but increase CPU overhead. The fairness parameter determines how the
limiter serves high and low priority requests. Dynamic adjustment
(RateLimiter::SetBytesPerSecond()) allows tuning the rate on the fly based on workload.

Atomicity and Failure Recovery of Compactions

Compactions involve rewriting data from multiple SST files into new files; failures during
compaction could leave the database in an inconsistent state. RocksDB ensures atomicity
and durability of compactions through the following process: (1) New output files are
written to temporary file numbers and recorded in the pending outputs list. (2) After writing
completes, the output files are marked as live and a VersionEdit is generated to record their
existence and to delete input files. (3) The VersionEdit is appended to the MANIFEST and
synced. (4) The new files are installed into the current version, and the old files are deleted.
If a crash occurs before installation, pending outputs may remain on disk; on recovery,
RocksDB scans the file list and deletes files that are not referenced by any version or
manifest entry. This ensures that incomplete compactions do not corrupt the database.
Additionally, compactions are run in separate threads with robust error handling; any I/O
error during compaction propagates to the database state, causing new writes to stall until
the error is addressed.

Optimizing Compaction for Specific Workloads

Tuning compaction options is critical for workload performance. For write-heavy workloads
where latency is less important, universal compaction may yield lower write amplification.
For read-heavy workloads requiring low latency, leveled compaction may be preferred.
Adjusting Options::max_background_jobs, Options::max_compaction_bytes, and
Options::target_file_size_base influences how aggressively RocksDB compacts data.
Increasing target_file_size_base results in fewer, larger SST files that reduce read
amplification but increase compaction write costs. Setting
Options::max_bytes_for_level_base controls when Level 1 compactions occur; larger
values delay compactions but reduce the number of L0 files, improving write throughput.
For time-series or TTL data, FIFO compaction eliminates rewriting, but read cost increases.

Handling Compaction and Flush Errors

Errors during flush or compaction are captured via the Env and passed to the DB layer.
When a flush or compaction fails (e.g., due to disk failure or full disk), RocksDB updates a
global error state. Subsequent writes fail with the same error until the error is cleared or
resolved. Users can call DB::GetOptions().error_handler to receive notifications and can
programmatically decide whether to retry or close the database. Some errors, like Stall
errors due to disk overload, may be transient; others, like No Space or IO Error, may
require manual intervention.

Read Path & Query Processing

High-Level Overview

RocksDB’s read path must e iciently locate keys across multiple in-memory and on-disk
structures: the current memtable, immutable memtables, Level 0 files, and files at deeper
levels. Queries may be point lookups, range scans, or prefix queries. Because the LSM tree
accumulates multiple versions of keys at di erent sequence numbers, the read path must
filter out obsolete entries and present the correct value for the requested key. RocksDB
achieves this through a combination of iterators, bloom filters, index blocks, and merge
operators. The read path is optimized for low latency by using caching (block cache),
memory mapping, and background compactions to reduce the number of files to search.

Lookups in Memtable and Immutable Memtables

When a key is requested (via DB::Get()), RocksDB first searches the active memtable.
Because the memtable stores internal keys (user key + sequence + type) in sorted order,
the search finds the most recent version of the key quickly. If the key is found with a value
type of value, the value is returned. If the key is found with a deletion or single deletion,
the result is “not found.” If the key is not present in the active memtable, RocksDB searches
immutable memtables in the order of recency. Searching memtables is typically faster than
reading from disk because data resides in memory and skip lists enable e icient search.

Merging iterators across files and levels

If the key is not found in any memtable, RocksDB searches the L0 files. Because L0 files
can overlap, every L0 file must be checked. For each L0 file, RocksDB uses the index block
to identify the relevant block; if a Bloom filter is present, it checks the filter first to avoid
reading the block if the key is absent. After scanning L0 files, RocksDB moves to the next
levels. For Level 1 and below, files have non-overlapping key ranges; RocksDB can perform
a binary search on the list of files to find the file whose range includes the key. Only one file
per level needs to be searched.

The core of RocksDB’s read path is the iterator abstraction. An iterator provides a cursor
over the keys in the database; multiple iterators (one for each level and memtable) are
combined into a single MergingIterator. The MergingIterator maintains a min-heap (a
priority queue) of iterators, sorted by the smallest current key. The heap ensures that the
next key returned by the merging iterator is the smallest among all inputs. When advancing
the iterator (e.g., iterator->Next()), RocksDB pops the smallest key from the heap, returns it,
and then advances the underlying iterator. If multiple iterators contain the same user key
(di erent sequence numbers), the merging iterator returns the record with the highest
sequence number (the most recent value) and skips older versions. This skip logic ensures
that obsolete entries and tombstones are not exposed. The merging iterator respects
snapshots by comparing sequence numbers: it hides entries whose sequence numbers
are greater than the snapshot’s sequence number. If a Merge operator is defined for the
column family, the merging iterator merges successive merge operands to compute the
final value.

Bloom Filters and Cache Lookups


Bloom filters play a critical role in the read path. When performing a point lookup, RocksDB
consults the filter block before reading the data block from disk. If the filter indicates that
the key is not present, the search on that file can be skipped entirely. This saves disk I/O
and reduces latency. Bloom filters are stored in the block cache; they may be pinned to
remain in memory. Partitioned filters allow RocksDB to avoid loading the entire filter when
performing lookups—only the relevant partition is accessed. Bloom filters use multiple
hash functions (2 or more) to map each key to bits in the bit array; if any bit is zero, the key
is not present. For plain table, a hashing index supports direct lookups.

The block cache stores uncompressed data blocks and index/filter blocks. When reading a
block, RocksDB first checks if it is present in the cache. If found, the block is used directly;
otherwise, it is read from disk and inserted into the cache. The block cache is sharded to
reduce lock contention and uses LRU or Clock eviction. Each block in the cache has an
associated priority: normal data blocks, metadata blocks (index and filter), and pinned
blocks. The high_pri_pool_ratio option reserves a fraction of the cache for high-priority
blocks to avoid eviction under heavy load. Pinning L0 filter and index blocks
(pin_l0_filter_and_index_blocks_in_cache) ensures that the most frequently accessed files’
metadata remain in the cachegithub.com.

RocksDB also provides a second cache for compressed blocks. When a compressed
block is requested, RocksDB checks the compressed cache before reading from disk. If the
compressed block is found, it is decompressed and inserted into the uncompressed
cache. The compressed cache is beneficial when using heavy compression algorithms
(e.g., ZSTD or LZ4) because decompressing from memory is faster than reading from disk.
The user can configure block_cache and compressed_block_cache separately and assign
sizes based on workload.

Tailing Iterators and Real-Time Reads

A tailing iterator allows reading appended data in real time without closing and reopening
the iterator. Tailing iterators are used for log streams, messaging systems, or real-time
analytics where data is appended continuously. A tailing iterator starts at a specified key
and continues to return new entries as they are committed to the database (after flush and
compaction). Internally, a tailing iterator tracks new memtables and new files in the version
set; when new data arrives, the iterator updates its internal list of iterators to include the
new files. The tailing iterator provides consistent ordering across memtables and SST files,
preserving sequence numbers. It also supports Next() and Seek() operations. Because
tailing iterators must update their state when new files are flushed or compacted, they use
a pinned super-version to ensure that the underlying version is not freed while the iterator
is in use.
Key Range Queries and Prefix Scans

Range queries in RocksDB require scanning across multiple files and levels. When iterating
over a range of keys, the merging iterator must handle overlapping files and ensure that
keys are returned in sorted order. To optimize range queries, RocksDB uses the following
techniques:

1. Range pruning: When iterating over a range, RocksDB uses the metadata in each
file (smallest and largest key) to skip files outside the range. It performs binary
search among files in each level to find the starting file and stops scanning when the
file’s largest key is greater than the range end.

2. Prefix extractors: For prefix scans, RocksDB uses prefix filters to skip files that do
not contain the prefix. The filter blocks can quickly determine if a file has keys with
the prefix. This reduces the number of files to search.

3. Compaction filter for range reads: For large range reads, RocksDB may trigger a
compaction to collocate data within a smaller key range into fewer files. This is more
of a manual tuning and not automatic.

4. Iterator pinning and small bu er reading: RocksDB supports pinned iterators,


which keep blocks in the cache pinned for the duration of iteration, avoiding
repeated lookups in the cache.

Merge Operator Handling on Reads

RocksDB’s merge operator allows storing merge operands instead of applying the merge
operation immediately. When reading a key with pending merge operands, RocksDB must
apply the merge operator on the fly to compute the final value. The merge operator’s
implementation defines how to combine the existing value and operands. Merge operands
can accumulate across multiple levels and memtables. The read path performs a lazy
merge: when a key is read, RocksDB iterates through all merge operands (stored as merge
records) in order of sequence number, applies the user-defined merge function, and
returns the final value. If a base value is encountered (a real value record), the merge is
applied to that base; if no base value exists, the merge is applied to an empty initial value.
This allows deferring the cost of merging until needed. Additionally, during compaction,
RocksDB may apply the merge operator proactively to reduce the number of merge
operands in lower levelsartem.krylysov.com.

Single Delete and Delete Semantics

In addition to full deletes, RocksDB supports single delete, which indicates that a key will
be deleted exactly once. Single delete improves performance by allowing compaction to
discard the deleted key without searching for older versions. However, single delete must
not be used if there may be multiple deletes for the same key; otherwise, the data may be
lost. The read path treats single delete similarly to normal delete: if it encounters a single
delete record, it returns “not found.” Compaction may optimize single deletes by dropping
them when encountering a base value or a merge operand.

Read Amplification and Optimization Strategies

Read amplification refers to the number of SST files or data blocks that must be read to
satisfy a query. In LSM trees, read amplification can be high because data is scattered
across multiple levels. RocksDB mitigates read amplification through several strategies:
leveled compaction, bloom filters, compaction filters to remove tombstones, caching,
prefix extractors, and tailored table formats. Through these techniques, RocksDB provides
low latency and high throughput for a wide variety of
workloadsartem.krylysov.comgithub.com.

Block / Table / Row Caching

Unified Block Cache Architecture

RocksDB uses a unified block cache to store uncompressed data blocks and metadata
blocks. The block cache is shared among all column families and across di erent DB
instances (if configured). It is typically implemented as an LRU cache or a Clock cache.
Each cache entry corresponds to a block from an SST file: data blocks, index blocks, filter
blocks, or other user-defined blocks. The block cache is divided into shards (e.g., 32 shards
by default) to reduce lock contention; each shard maintains its own LRU list and hash
table. The total cache size is configured via BlockBasedTableOptions::block_cache (a
shared Cache instance). Sharding ensures that concurrent cache accesses scale across
CPU cores and that eviction decisions are localized to each shard, improving
concurrencygithub.com.

Within the block cache, each block is assigned a priority: data blocks (normal priority),
metadata blocks (index and filter), and pinned blocks. High priority blocks are less likely to
be evicted. Users can reserve a portion of the cache for high priority blocks via
BlockBasedTableOptions::high_pri_pool_ratio. Without such reservation, frequent reads of
data blocks could evict metadata blocks, leading to expensive index reads from disk.

LRUCache vs. ClockCache

RocksDB provides two primary implementations of the cache interface: LRUCache and
ClockCache. LRUCache tracks the order of access precisely and evicts the least recently
used block when the cache is full. ClockCache is an approximation of LRU using a circular
bu er (the clock) where a reference bit is used to track usage. ClockCache has lower
overhead for insertion and eviction because it does not maintain a double-linked list.
However, it can be less accurate in identifying the least recently used block. Both caches
support sharding to reduce lock contention. The default block cache is an LRU cache of
32 MB when not explicitly setgithub.com.

Uncompressed and Compressed Block Caches

In addition to the uncompressed block cache, RocksDB can maintain a compressed block
cache. When a data block is requested, RocksDB first checks the uncompressed cache. If
the block is not present, it checks the compressed cache; if found, RocksDB
decompresses the block and inserts it into the uncompressed cache for future use. The
compressed cache is beneficial when using heavy compression algorithms (e.g., ZSTD or
LZ4) because decompressing from memory is faster than reading from disk. The
compressed cache is configured via BlockBasedTableOptions::compressed_block_cache.
Users can set separate sizes for the uncompressed and compressed caches.

Pinning Index and Filter Blocks

RocksDB allows pinning index and filter blocks to the block cache to prevent them from
being evicted. Because index and filter blocks are small but critical for lookups, pinning
them improves read performance under memory pressure. Options such as
BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache ensure that the index
and filter blocks of Level 0 files are pinned. This is especially important because Level 0
files contain fresh data and are accessed frequently. The option
pin_top_level_index_and_filter pins the top partitions of partitioned filters to reduce
memory usage while ensuring quick access.

High Priority Pool and Eviction Policy

The block cache implements a high priority pool to protect certain blocks from eviction.
The ratio is set by BlockBasedTableOptions::high_pri_pool_ratio. When the cache is full and
eviction is necessary, the system first tries to evict normal priority blocks; if none are
available, it evicts high priority blocks, but only up to their pool limit. This design prevents
an entire cache from being filled by normal data blocks while starving metadata blocks.

Persistent Cache

RocksDB optionally supports a persistent cache stored on fast storage, such as NVMe or a
dedicated SSD, to augment the block cache. A persistent cache stores compressed blocks
or uncompressed blocks across DB restarts. When the block cache evicts a block, it can
write it to the persistent cache. On a cache miss, RocksDB checks the persistent cache
before reading from disk. This mechanism reduces warm-up time after a restart and allows
caching large databases. The persistent cache uses a log-structured design (similar to the
WAL) to append new blocks and periodically cleans up old data. The
PersistentCacheOptions allow specifying the path, size, and block persistence policy.

Read Request Lifecycle with Caching

During a point lookup, RocksDB’s read path interacts with caches as follows: (1) Check the
row cache; if the key is found, return the value. (2) Search the memtable and immutable
memtables; if found, return. (3) Check the bloom filter of each relevant SST file; if the filter
says the key is absent, skip. (4) For each file that may contain the key, check the block
cache for the index block; if not present, read from disk. (5) Use the index block to locate
the data block; check the block cache for the data block; if not present, read from disk. (6)
Decompress the block if needed; insert the decompressed block into the block cache. (7)
Search within the block to find the key; if found, return value; if not, continue. (8) Insert the
(key, value) pair into the row cache for future lookups. This layered approach ensures that
frequently accessed blocks are served from memory, while the cost of decompressing
blocks is amortized across multiple reads.

Cache Statistics and Instrumentation

The block cache and row cache expose detailed statistics accessible via the Statistics
object. Counters track hits, misses, prefetching hits, and eviction events. Histograms
record block sizes, time spent on cache lookups, and hit ratios. Users can query these
statistics at runtime or dump them to logs. Monitoring cache hit ratios is essential for
tuning cache sizes; a low hit ratio may indicate insu icient cache capacity, while a high hit
ratio suggests that the working set fits in cache.

Column-Family & Version Manager

Column Families: Isolation and Configuration

Column families in RocksDB allow isolating di erent key spaces within the same database
and applying independent options to each. Each key-value pair belongs to one column
family. Column families share the same write-ahead log (WAL) and manifest but have
separate memtables, flush queues, and SST filesgithub.com. This architecture enables
di erent tuning for each family: one can have a large memtable with heavy writes and
another with a small memtable and heavy reads. Column families can be created or
dropped dynamically, and write batches can modify multiple families atomically. Because
column families share the WAL, a flush in one family triggers a new WAL that must also be
referenced by other families until they flush.
Options and Inheritance

Each column family is associated with a set of options (ColumnFamilyOptions) that can
override global DBOptions. Options include compaction style, compression, table factory,
write bu er size, memtable factory, prefix extractor, filter policy, and merge operator. When
a column family is created, its options are saved in the manifest. Some options cannot be
changed dynamically (e.g., comparator, merge operator) because they define the on-disk
format. Others can be updated live via SetOptions() and SetDBOptions().

Version Set and Super Versions

A Version represents a snapshot of the database state: the set of SST files in each level for
each column family. When a flush or compaction finishes, RocksDB creates a new version
by applying VersionEdits to the previous version. The new version becomes the
super-version, which includes pointers to the current version of SST files, memtables, and
immutable memtables. Versions are reference counted; iterators hold references to the
version they use. When a new super version is installed, old versions remain accessible
until no iterators reference them. The super version also contains mutable options (e.g.,
write bu er size) that can change dynamicallygithub.com.

Snapshots and Timestamped Snapshots

Snapshots capture a consistent view of the database at a particular sequence number.


When a snapshot is taken (GetSnapshot()), RocksDB records the current sequence number
and adds it to a snapshot list. Reads associated with the snapshot filter out entries with
sequence numbers greater than the snapshot’s sequence numbergithub.com. Snapshots
can be timestamped; timestamped snapshots map timestamps to sequence numbers so
that multiple DB instances can take snapshots at the same wall clock time. When the
oldest snapshot is released (ReleaseSnapshot()), RocksDB may drop obsolete data during
compaction.

Column Family Drop and Metadata Cleanup

Dropping a column family removes it from the database. RocksDB marks the family as
dropped in the manifest and schedules its files for deletion when they are no longer
referenced by snapshots or iterators. Because families share the WAL, logs can only be
deleted after all families referencing them have flushed. Dropping a family does not block
writes to other families.

Compaction and Column Family Interaction

Each column family has independent compaction triggers and options. A compaction job
may involve files from a single family. The manifest records which files are added or deleted
per family. Scheduling fairness across families is not guaranteed: families with more data
may receive more compaction resources. Advanced setups can use separate thread pools
per family via Env::SetBackgroundThreads()github.com.

Dynamic Option Tuning and Live Reload

Certain options can be changed at runtime using SetOptions() and SetDBOptions().


Changing these options triggers the creation of a new super version. Options that can be
modified dynamically include write bu er size, target file size base, compression per level,
and dynamic level bytes. Hard options a ecting data format (e.g., comparator, table
factory) cannot be changed live.

Version Edit Encoding and Compatibility

VersionEdit uses a tag-length-value encoding to record metadata changes. Each field has a
tag (integer) followed by its length and value. Fields include comparator name, log number,
next file number, last sequence, added files, deleted files, and column family ID. Optional
fields enable forward compatibility; unknown tags are skipped when reading. Backward
compatibility is maintained by including only fields known to the current version.
Snapshots include a full list of column families and files to reconstruct the state when the
manifest grows too largegithub.com.

Super Version and Thread Safety

The super version contains pointers to the current version of the SST files and other state; it
is protected by a read-write lock. Readers acquire a read lock to access the super version;
writers (flush or compaction threads) acquire the write lock to install a new version. When
a column family is dropped or new options are applied, the super version is updated
atomically. Versions are reference counted; when the reference count drops to zero, the
version is deleted and its files may be removed.

Transaction & Concurrency Control

Introduction to Transactions in RocksDB

RocksDB supports both pessimistic and optimistic transactions to provide ACID


semantics and isolation across concurrent operations. Transactions enable multiple
operations to be executed atomically, with isolation from other threads, and the ability to
roll back in case of an errorgithub.com.

Pessimistic Transactions

In pessimistic transactions, RocksDB acquires locks on keys before they are modified.
Locks prevent other transactions from modifying the same keys until the current
transaction commits. The lock table is partitioned into stripes to reduce contention; each
stripe is a hash table keyed by user keys. Write operations under a pessimistic transaction
are applied to the memtable, and the WAL is updated. Reads can be performed within the
transaction and see uncommitted writes. Snapshots can be set to provide consistent
reads. Locks are held until commit; if a lock cannot be acquired within a timeout, the
operation fails.

Optimistic Transactions

Optimistic transactions assume that conflicts are rare and do not acquire locks during
writes. Instead, they detect conflicts at commit time. Writes are bu ered in a private write
batch; at commit, the transaction checks if any keys have been modified since the
transaction began. If a conflict is detected, the commit fails and the transaction is
retriedgithub.com. Optimistic transactions are suitable for high concurrency workloads
where conflicts are uncommon.

Two-Phase Locking and Isolation Levels

Pessimistic transactions implement two-phase locking: locks are acquired during the
growing phase and released during the shrinking phase. RocksDB also supports snapshot
isolation by recording a snapshot in the transaction. Isolation level and commit semantics
can be tuned via TransactionOptions::write_policy (WRITE_COMMITTED,
WRITE_PREPARED, WRITE_UNPREPARED). Write prepared transactions split commit into
prepare and commit phases, supporting distributed two-phase commit.

Merge Operator in Transactions

Merge operations accumulate merge operands in the memtable without reading the
existing value. During a transaction, merge operations respect locks for pessimistic
transactions and record operands for optimistic transactions. At commit or read time, the
merge operator is applied to compute the final value. The merge operator must be
associative and idempotent, ensuring that reordering or replaying merge operations yields
consistent resultsgithub.com.

Snapshot Isolation and Write Conflicts

Transactions use sequence numbers and snapshots to provide isolation. A snapshot


captures the sequence number at the time of snapshot; keys with larger sequence
numbers are invisible to the transaction. At commit, the DB checks whether any keys
modified by the transaction have been written by other transactions with higher sequence
numbers. If so, a conflict occurs and the transaction aborts.

Save Points and Rollback


Transactions support save points, allowing partial rollbacks. Setting a save point records
the current write batch size; rolling back discards writes made after the save point and
releases corresponding locks. Save points facilitate complex transaction logic where some
operations may fail and others may succeed.

Transaction Metadata and Logging

Transaction metadata (begin marker, record markers, commit marker) is recorded in the
WAL for recovery. In case of a crash, RocksDB replays the WAL and applies only committed
transactions; uncommitted transactions are rolled back. Write prepared transactions
record a prepare marker and a commit marker to support distributed commit.

Deadlock Detection and Timeouts

RocksDB implements deadlock detection by tracking lock acquisition order and using
timeouts. If a transaction cannot acquire a lock within the timeout, it fails. Applications can
abort and retry the transaction to break deadlocks. Optimistic transactions avoid
deadlocks because they do not acquire locks.

Transaction Write Bu er and Concurrency Control

Write operations in a transaction are bu ered in a WriteBatch. For pessimistic


transactions, the batch is applied to the memtable under the protection of locks. For
optimistic transactions, the batch is private until commit. The size of the batch can be
limited to prevent excessive memory usage.

Integration with Column Families and Snapshots

Transactions can span multiple column families. Operations on di erent families are
ordered by the transaction’s sequence number. Locks and conflict detection apply per key
per family. Snapshots used in transactions apply across families; reading via the snapshot
returns the state of all families at the snapshot’s sequence number. If a family is dropped
during a transaction, the transaction is aborted.

Background Thread Pools & Rate Limiter

Thread Pools in RocksDB

RocksDB uses thread pools to execute background tasks such as flushes, compactions,
file deletions, and error handling. The Env manages multiple thread pools with di erent
priorities: a high priority pool for flushes and low priority pools for compactions and
deletions. The number of threads in these pools is configured via max_background_jobs
and SetBackgroundThreads(). Separating flush and compaction threads prevents
compaction jobs from blocking flushesgithub.com.
Scheduling and Job Types

Tasks scheduled on background threads include memtable flush (FlushJob), compaction


(CompactionJob), deletion of obsolete files, and error handling. Flush jobs are high priority
because they free memtable memory; compaction jobs are low priority because they can
run in the background. The scheduler controls the number of concurrent flushes and
compactions based on available threads.

Rate Limiter and Write Throttling

The rate limiter controls the aggregate write throughput of flushes and compactions. It uses
a token bucket algorithm with refill periods and fairness to schedule writes across multiple
priority classesgithub.com. If compactions saturate disk bandwidth, the rate limiter
reduces their throughput to leave room for user reads and flushes. Users can adjust the
rate dynamically to adapt to workload changes.

I/O Types and Priorities

The Env defines di erent I/O classes (IO_USER, IO_HIGH, IO_MID, IO_LOW). User I/O
(reads and writes) usually bypasses the rate limiter or uses a high priority. Flushes use high
priority; compactions use mid or low priority. The rate limiter serves requests based on
priority and fairness. High priority requests get tokens first; low priority requests may wait
longer.

Dynamic Adjustment and Self-Adapting Throttle

RocksDB allows adjusting the rate limiter at runtime via SetBytesPerSecond(). The write
controller monitors compaction backlog (compaction debt) and introduces delays on
writes if the backlog grows. The combination of rate limiter and write controller ensures
smooth throughput and prevents stalls.

Background Error Handling

Errors in background jobs are propagated to the DB’s error state. When an error occurs,
subsequent writes return an error until the issue is resolved. The error handler categorizes
errors as transient or permanent and can retry operations or close the DB.

Environment Abstraction and Thread Pools

Thread pool behavior depends on the Env implementation. The default POSIX Env uses
POSIX threads; other Envs may map threads to system thread pools or asynchronous
frameworks. Schedule() and Run() interface methods abstract the scheduling of tasks on
threads. Custom Envs may integrate with event loops or specialized schedulers.
Environment & File-System Abstraction (Env)

Overview of the Env Interface

The Env class provides an abstraction layer between RocksDB and the underlying operating
system or file system. It defines methods to open and read files, write data, create
directories, and manage threads. By using Env, RocksDB decouples its core logic from
specific I/O implementations. The default Env uses POSIX system calls; other Envs
integrate with HDFS, S3, or memory file systems. The Env manages thread pools,
scheduling flush and compaction tasks, and interacting with the rate limiter. Advanced
features like direct I/O and checksums are configured via Env optionsgithub.com.

POSIX Env and Direct I/O

In the default POSIX Env, file operations map to standard UNIX calls. RocksDB uses pread
and pwrite for random I/O, fsync to persist data, and ftruncate to set file sizes. Direct I/O is
enabled via flags like O_DIRECT on Linux, allowing RocksDB to bypass the OS page cache.
Direct I/O reduces double bu ering and can improve latency by letting RocksDB manage
caching explicitly. However, it requires aligned reads and writes and cannot be used with
memory mapping. Direct I/O is typically enabled for SST files but not for the WAL or
manifest because the OS cache is needed for ordering and durability.

Checksum Verification and Data Integrity

RocksDB stores checksums for each data and index block. When reading a block, it verifies
the checksum to detect corruption. The default checksum is CRC32C; xxHash is also
supported. Checksums detect bit flips, partial writes, and other forms of corruption. WAL
and manifest entries are also protected by CRC32 checksums. If a corruption is detected,
RocksDB returns an error and avoids using the corrupted data. Custom checksum
generators can be implemented via the FileChecksumGenFactory interface.

Env Options: Rate Limiting and Thread Scheduling

The Env exposes parameters for controlling I/O rate and thread scheduling. The rate limiter
(described previously) throttles writes for flush and compaction jobs. The Env also allows
scheduling tasks with di erent priorities using Schedule(). Custom Envs may map these
priorities to di erent thread pools or OS I/O priorities. Functions like NowMicros(),
SleepForMicroseconds(), and GetRandomSeed() provide time, sleep, and randomness
abstractions.

HDFS and S3 Envs


RocksDB provides Env implementations for Hadoop Distributed File System (HDFS) and
Amazon S3. The HDFS Env wraps libhdfs to open and write files on HDFS, adapting
RocksDB’s random and sequential read interfaces to HDFS’s block I/O model. The S3 Env
uses AWS SDK to read and write objects in S3 buckets. Because S3 is eventually
consistent, the Env may implement local caching and metadata. These Envs adapt to
higher latency and limited concurrency, tuning the rate limiter and thread pools
accordingly.

Pluggable Table & Block Formats

Block-Based Table Format

The block-based table is the default on-disk format for RocksDB. It stores compressed data
blocks, index blocks, filter blocks, and metaindex blocks. The format is highly configurable:
block size, restart interval, compression type, filter policy, and index type (single level, two
level) can be tuned. Partitioned filters and indexes allow scaling to large files by dividing the
index and filter into partitions. Pinned blocks ensure that Level 0’s metadata stays in the
cache. These options balance memory usage, read latency, and file sizegithub.com.

Plain Table Format

The plain table format is optimized for in-memory workloads. It stores keys and values
sequentially with no compression or delta encoding. A hash index maps prefixes to o sets
for direct lookups. Plain tables require a prefix extractor because the hash index groups
keys by prefix. The file size is limited (2^31−1 bytes) because o sets are stored as 32-bit
integers. Plain tables support forward iteration but not non-prefix seeks. They are
recommended when the dataset fits entirely in memory or when a memory-mapped file is
used for low latency.github.com

Custom Table Factories and Block Formats

RocksDB exposes a TableFactory interface for custom table formats. Developers can
implement custom storage layouts by providing a builder (for writing) and reader (for
reading). The name of the factory is stored in SST metadata; RocksDB verifies that the same
factory is used when reopening the file. Examples include the FIFO table for TTL data and
specialized columnar formats for analytics. Block formats (within block-based tables) can
also be customized by implementing new compression codecs, filter policies, or index
types. These plugins extend RocksDB’s flexibility and allow adapting to new hardware or
data models.

Compression & Checksum Codecs

Compression Algorithms and Per-Level Configuration


RocksDB supports multiple compression algorithms, configurable globally or per level.
Snappy, LZ4, LZ4HC, Zlib, ZSTD, and Bzip2 are built-in; Brotli and custom codecs can be
added via plugins. LZ4 is often used for upper levels due to its low CPU overhead; ZSTD or
Zlib are used for bottom levels to maximize space savings. Compression can be set per
level via compression_per_level or for the bottommost level via bottommost_compression.
The choice of compression balances CPU usage and disk spacegithub.com.

Compression Dictionaries and Training

Compression dictionaries improve compression ratio by capturing patterns across blocks.


A dictionary is trained on sample data and passed to the compressor. RocksDB supports
global dictionaries via CompressionOptions::dict and per-level dictionaries via advanced
options. Dictionaries are especially e ective for structured data with repeated patterns.

Checksum Algorithms and Options

Each block includes a checksum trailer; the algorithm can be CRC32C or xxHash. CRC32C
is faster and provides good error detection; xxHash is faster but has slightly weaker
detection. BlockBasedTableOptions::checksum sets the checksum type. Checksums can
be disabled (not recommended) to avoid overhead. The manifest and WAL always use
CRC32.

Comparator & Merge-Operator Framework

Custom Comparators

The comparator defines the order of user keys. The default comparator compares keys
lexicographically as byte arrays. Custom comparators can order keys by composite fields
(e.g., primary key and timestamp). Comparators must be consistent with equality and
provide methods to shorten separators for index compression. The comparator name is
stored in metadata; RocksDB fails to open a DB if the comparator name di ers from the
one used when writing the datagithub.com.

Merge Operators

Merge operators allow incremental updates to a value without reading it first. The merge
operator defines how to combine an existing value with a new operand. Associative merge
operators assume the merge function is associative; general merge operators support full
merge and partial merge. Merge operands can be accumulated in the memtable and across
SST levels. A full merge occurs during a read or compaction; RocksDB passes the base
value and a list of operands to the merge operator, which computes the final value. Partial
merge combines operands without a base value; the operator must ensure idempotence
and commutativity to avoid inconsistent resultsgithub.com.
Statistics & Instrumentation

Types of Statistics

RocksDB collects counters (tickers), histograms, properties, perf contexts, and I/O stats
contexts. These metrics track events like cache hits, compaction drops, flush times, and
operation latencies. Histograms provide percentiles. Properties report current values such
as number of SST files and memtable size. Perf context and I/O stats context record
per-operation metrics, including time spent in di erent phases and bytes read or
writtengithub.com.

Configuring and Accessing Statistics

To enable statistics, the application must create a Statistics object via CreateDBStatistics()
and assign it to DBOptions::statistics. The stats level determines the granularity; higher
levels include histograms and timers but increase overhead. Counters can be read via
getTickerCount(), histograms via histogramData(), and properties via GetProperty(). Perf
context and I/O stats context are enabled per thread and cleared at the start of each
operation.

Event Tracing and Logging

RocksDB can produce detailed trace files capturing events like flushes, compactions, and
writes. Tracing is enabled via StartTrace() and outputs events in JSON format for analysis.
The info log records high-level events and is rotated based on size or time. External
monitoring can be integrated via StatsPublisher, which exports metrics to systems like
Prometheus.

Configuration & Live-Options Tuning

DBOptions and ColumnFamilyOptions

RocksDB o ers a myriad of options to tune performance. Global options (DBOptions)


include max_background_jobs, wal_dir, max_wal_size, and the rate limiter. Column family
options include memtable size, compaction style, compression, table factory, and prefix
extractor. Options can be set when opening the DB or changed dynamically if allowed.
Hard options a ecting on-disk format (comparator, merge operator, table factory) must
remain constant.

Dynamic Tuning and Auto-Adaptation

Certain options can be tuned at runtime using SetOptions() and SetDBOptions(). Dynamic
tuning allows adjusting memtable sizes, compaction parameters, and compression
settings based on workload. Auto-tuning mechanisms adjust the write rate when
compaction debt grows or reduce parallelism when the system is saturated.

Tuning for Specific Workloads

Write-heavy workloads benefit from larger memtables, fewer compactions, and lighter
compression. Read-heavy workloads benefit from leveled compaction, Bloom filters, and
large block caches. Mixed workloads require balancing these settings. Compression per
level and compaction style can be tuned to fit the workload.

Monitoring and Verification

Monitoring metrics (e.g., stall count, compaction throughput, cache hit ratio) guides tuning
decisions. RocksDB verifies options on open to ensure compatibility with existing data.
Changing incompatible options (like comparator) results in an error.

Utility & Maintenance Tools

ldb Command-Line Tool

ldb is a versatile CLI utility for interacting with RocksDB databases and SST files. It supports
operations such as get, put, batchput, scan, ingest, repair, destroy, and compact. The tool
allows specifying column families, key/value encoding, and snapshots. It can open a DB in
secondary mode for read-only operations without interfering with the primary instance. ldb
can also dump keys and values from SST files directly, verify checksums, recompress files,
and generate repair manifestsgithub.com.

sst_dump Tool

sand_stream_guard is a mis-sleading; we call again (should be sst_dump). The sst_dump


tool inspects SST files. It can print key/value pairs, verify checksums, recompress files, and
merge multiple SSTs. It shows table properties and internal keys, assisting debugging and
migration tasks.

Checkpoints and Online Backup

Checkpoints create a consistent snapshot of the DB by flushing memtables and linking SST
files into a checkpoint directory. The Checkpoint API and BackupEngine library provide
mechanisms for o line and online backups. BackupEngine supports incremental backups
and restoration. Files are hard linked to avoid copying data, and only new files are copied
for subsequent backups.

Repair, Compact Range, and SST Ingestion


The repair API rebuilds the manifest by scanning existing SST files, used as a last resort
when the manifest is corrupted. The CompactRange() API triggers manual compaction for a
key range or the entire DB. SstFileWriter allows generating SST files externally for bulk
ingestion via IngestExternalFile(), bypassing the memtable and improving ingestion speed.

Live Backup, Dump, and Load

Live backup uses BackupableDB to copy files incrementally while the DB is running. Dump
and load utilities convert RocksDB data to and from portable formats for migration. Stress
tests and corruption tests validate reliability under various scenarios.

Testing & Benchmark Harnesses

db_bench Benchmark Tool

db_bench is RocksDB’s primary benchmark suite. It supports workloads like fillseq,


fillrandom, readrandom, readseq, readreverse, readwhilewriting, deleterandom, and
merge. Users specify benchmarks and options such as key size, value size, compression,
number of threads, and rate limiter settings. db_bench prints operations per second,
microseconds per operation, and level stats. It is widely used to evaluate RocksDB
performance under di erent configurationsgithub.com.

Persistent Cache Benchmark

persistent_cache_bench measures the benefit of a persistent cache, comparing block


cache hits and misses and evaluating latency improvement. Users can vary cache sizes,
compression types, and access patterns.

Stress Tests and Corruption Tests

The RocksDB repository includes stress tests simulating random operations and high
concurrency, and corruption tests intentionally damaging files to verify error detection.
Developers run these tests via make check or specific test binaries. Passing the tests
ensures durability and correctness under edge cases.

Microbenchmarks and Profiling

Microbenchmarks measure specific functions such as checksum computation, merge


operator performance, and skip list speed. Profiling tools help identify bottlenecks for
optimization.

Conclusion

This report has provided an extensive exploration of RocksDB’s internal architecture.


Starting from the write path, we examined how writes are logged, stored in memtables, and
flushed to immutable SST files. We then looked at the LSM tree organization, version
management, compaction strategies, and read path. Further sections covered caching
subsystems, column families, transactions, concurrency control, background thread
pools, environment abstraction, table formats, compression, comparators, statistics,
configuration tuning, utilities, and testing harnesses. Throughout, citations from o icial
documentation and code analysis grounded the descriptions. Understanding these
internals enables developers and administrators to tune RocksDB for specific workloads,
implement custom components, and ensure robust operation across diverse deployment
environments.

You might also like