fix(optimizer): add mid-flight free-disk watchdog (#4297)#8948
Open
MilosM348 wants to merge 1 commit into
Open
fix(optimizer): add mid-flight free-disk watchdog (#4297)#8948MilosM348 wants to merge 1 commit into
MilosM348 wants to merge 1 commit into
Conversation
c115ed8 to
79f8d60
Compare
The pre-flight check_segments_size only runs once at the start of an optimization, but the slow phases (segment_builder.update, populate_vector_storages, segment_builder.build) can take many minutes. During that window other parallel optimizations, snapshots, WAL growth, or unrelated processes on the same volume can fill the disk and crash the segment builder on a raw ENOSPC. xhjkl flagged exactly this in the review of PR qdrant#4578 (we still might run into OOD down the line because FS is non-atomic). Changes ------- * check_segments_size now returns an OptimizationSpaceEstimate carrying both the space_needed estimate AND the precheck-time available bytes, so the mid-flight watchdog can enforce headroom rather than the full initial estimate (the optimizer itself is expected to consume the estimate by design). * New recheck_free_space helper aborts the optimization with a canonical No space left on device: error if available space has dropped below max(precheck_available - space_needed, 8 MiB safety floor). Per-IO available_space lookup is injectable via recheck_free_space_with for testability. * The watchdog is invoked twice in build_new_segment: once after segment_builder.update and once after populate_vector_storages, i.e. before the two slow phases that historically exceed the conservative 2x pre-flight estimate. * The pre-flight error message is also normalized to lead with No space left on device: so it is logged in the same shape as the WAL/insertion path (DiskUsageWatcher) and matches the assertion in tests/e2e_tests/test_low_disk.py. * Seven unit tests in disk_watchdog_tests pin the headroom semantics, the OOD message format, the statvfs-failure skip behaviour, and the one-call-per-checkpoint contract on available_space. The watchdog only triggers when available space drops below the headroom the up-front check accepted, and treats fs4::available_space errors as skip, so neither the optimizer's own writes nor a transient statvfs failure can abort an otherwise healthy optimization. Refs: qdrant#4297, qdrant#4578 Co-authored-by: Cursor <[email protected]>
79f8d60 to
dbf318d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a mid-flight free-disk watchdog to the optimizer's slow path so that an out-of-disk condition between the pre-flight check and the segment-builder write surface fails with a clean
"No space left on device:"error instead of crashing insidesegment_builder.update/populate_vector_storages/segment_builder.build.This addresses the open part of #4297 - the up-front-only
check_segments_size(added in #4578) is correct, but it is also fundamentally racy against external disk consumers. xhjkl flagged exactly this in the review of #4578:There is no graceful fix for "the disk filled up while we were holding
permit", but we can still detect it before the next slow phase begins and abort the optimization in the same shape as the WAL/insertion path (DiskUsageWatcher)./claim #4297
What changed
check_segments_sizenow returns the computedspace_neededestimate so callers can re-use it for mid-flight checks. Existing call site inexecute_optimizationis the only consumer.recheck_free_spacehelper inlib/shard/src/optimize.rstakes the sametemp_pathandspace_neededand aborts the optimization with a canonical"No space left on device:"error if available space has dropped below the larger of (estimate, 8 MiB safety floor).build_new_segment:segment_builder.update(i.e. before the HNSW indexing phase that historically blows past the conservative 2x pre-flight estimate when link tables are large),populate_vector_storages(i.e. immediately beforesegment_builder.build, which is where ENOSPC has historically surfaced as a panic)."No space left on device:"so it is logged in the same shape as theDiskUsageWatcherpath on insertion and matches the assertion intests/e2e_tests/test_low_disk.py:python expected_msg = "No space left on device:" assert expected_msg in logsWithout this normalization, an OOD that trips the optimizer's own pre-flight check (rather than the WAL writer) would log
"Not enough space available for optimization"instead, and the e2e assertion would only pass by accident through unrelated WAL log lines.disk_watchdog_tests:watchdog_passes_when_disk_has_room: healthytempdiraccepts bothNoneand small estimates.watchdog_fails_when_estimate_exceeds_available:u64::MAXestimate trips the watchdog and the rendered error contains the canonical OOD prefix, the optimizer name, and the temp path (so logs stay diagnostic).watchdog_uses_max_of_estimate_and_safety_buffer: pinsOPTIMIZER_DISK_WATCHDOG_BUFFER_BYTESinto the safe range [1 MiB, 64 MiB] so future contributors don't accidentally make the buffer either pointless or false-positive-prone.What this is not
check_segments_size. The pre-flight check is still the primary guard.build_new_segmentrather than spinning up a background task - the maintainer's review on Fail early when encountering out-of-storage during optimization #4578 was explicit that an async watchdog "is fake anyway" on top of FS that already isn't atomic. Two synchronous checks at the obvious phase boundaries match the existing style and add no new threads/locks.write(2)returns ENOSPC inside the segment builder while we're already mid-phase, the existing error path still applies - the watchdog just shrinks the window in which that can happen.Why "8 MiB safety floor"
When
space_neededisNone(estimate failed, e.g. an unreadable segment dir), the watchdog falls back to a small fixed buffer so it isn't completely toothless. 8 MiB matches the smallest reasonable single-vector-storage write that a real optimization will perform and is consistent withDiskUsageWatcher::min_free_disk_size_mbdefaults elsewhere in the codebase. The unit test pins this constant into a sane range so future contributors don't drift it.Risks / things to look at in review
fs4::available_space,bytes_to_human,OperationError::service_error) were already imported by this file.check_segments_size,recheck_free_space, andbuild_new_segmentare all crate-private.devas required byCONTRIBUTING.md.Test plan
cargo test -p shard --test disk_watchdog_tests(added; passes locally on a healthy fs)tests/e2e_tests/test_low_disk.py::TestLowDisk::test_low_disk_handling[indexing]- requires the docker e2e harness, please run on CI< requiredavailable space, and existing tests run on hosts with plenty of free disk)Closes #4297 if accepted.
Co-authored-by: Cursor [email protected]