- Global large-chunk registry:
std::mutex - Per-large-chunk regular-chunk map:
std::mutex - Per-regular-chunk payload:
std::shared_mutex- shared for
GET/EXISTS/CHUNKEXISTS/CHUNK/CHUNKBIN - unique for
SET/UNSET/CHUNKSET
- shared for
Effects:
- concurrent reads on same chunk: allowed
- write vs read on same chunk: serialized
- operations on different chunks: can run concurrently
To avoid deadlocks:
- global large-chunk mutex
- large-chunk mutex
- regular-chunk payload mutex
The engine never acquires two regular-chunk payload locks in one operation.
- Accept loop enqueues accepted sockets.
- Fixed worker pool processes connections.
- Connection parsing is buffered (not byte-by-byte recv loops).
This replaces detached thread-per-connection behavior and provides bounded thread growth.
Default model: Single-Writer / Multi-Reader per data_dir.
- Writer ownership is coordinated under
data_dir/.chunkdb.lock/:writer.lock: OS file lock for active writer exclusivity.writer.meta: metadata heartbeat (session_id,pid,heartbeat_ms, mode).
- A second writer fails fast while
writer.lockis held. - Read-only stores (
access_mode=kReadOnly) do not take writer ownership and can run concurrently with the writer. - On writer restart/takeover, stale metadata is detected and moved to
writer.meta.stale.<timestamp>before a new session is published. - Writer metadata heartbeat is periodically refreshed while the writer process is alive.
Crash behavior:
kill -9/crash releases the OS lock when the process exits.- Next writer instance can take ownership and publish a new session id.
- Clean shutdown removes active
writer.meta.
Override (allow_multiple_processes) bypasses this safety model and is unsafe unless external coordination is guaranteed.
max_loaded_chunkslimits in-memory chunk cache size.- LRU-style eviction removes least-recently-used chunks that are not currently referenced.
- Before evicting a chunk, pending WAL batch bytes are flushed to disk.
This prevents unbounded growth in long-running sparse-world workloads while preserving chunk correctness across load/unload cycles.
- WAL writes do not use
fsync. - WAL flush can be batched by
wal_group_commit_updates. - Lowest latency, weakest crash/power-loss guarantees.
- Checkpoint image replace is atomic in namespace, but no required temp-file/data or directory sync.
- WAL is appended and
fsynced per acknowledged write. - On first WAL file creation in this mode, parent directory metadata is also synced.
- Acknowledged writes are durable in WAL after successful
fsync. - Checkpoint image replace remains atomic in namespace, but checkpoint file/directory sync is not required by this mode.
fsync-walsemantics plusfsyncfor checkpointed.chk+ directory updates.- Strongest current mode.
- Checkpoint sequence:
- write temp image in same directory
- flush temp file data (
fdatasync/fsync, andF_FULLFSYNCattempt on macOS) - close temp file with error check
- atomic replace
- sync parent directory metadata (best-effort fallback on Windows if directory-handle flush is unsupported by the runtime/filesystem)
- Normal restart recovery:
- WAL replay restores committed on-disk deltas.
- Atomic replace is about namespace visibility (old-or-new target path state), not equivalent to guaranteed post-power-loss durability.
relaxedmode may lose more recent acknowledged writes due to absentfsyncand optional group commit batching.- Clean shutdown flushes pending WAL batches before process exit.
- Power-loss semantics still depend on mode and filesystem/device behavior.
- Engine does not provide full ACID transactional semantics across multiple chunks.
Covered crash points in current validation:
- crash/fault after temp-file flush and before replace: old target remains readable; stale temp artifact is cleaned on later load.
- torn/truncated WAL tails: replay stops safely at the invalid tail and preserves earlier valid deltas.
- interrupted writer process (
kill -9) in durability kill-recovery tests: restart remains writable and recovers valid state.
Not yet fully proven:
- arbitrary kernel/storage reorder faults beyond the tested crash points
- silent hardware corruption outside CRC-covered payload/record checks
- exhaustive fault matrices across all filesystems/devices and mount options
- No cross-chunk atomic transactions.
- No replication.
- No consensus or distributed durability.
- No claim of full ACID database guarantees.