A bug-hunt of the WAL shipping / shadow / LTX / sync / restore / DST surface. Each finding lists severity, location, the bug, the fix, and a Status: Fixed (implemented + build green + tested), Partial (safest correct fix landed; remainder deferred with a reason), or Documented (verified real; fix specified). Line numbers are approximate against the reviewed revision; re-locate before editing.
Status: F1–F15 are Fixed. The DST harness (F14) now builds and runs its property/chaos/invariant tests against the current crate API, exercising real storage faults.
src/ltx.rs:90-100,crates/walrust-core/src/ltx.rs:78-99decode_to_dbindexeddb_data[start..start+page_size]using a per-pagepage_numread from the (untrusted) LTX, with no1 <= page_num <= commitcheck, and sized the buffer with an uncheckednum_pages * page_size. A corrupt/crafted LTX panicked (OOB slice) in the binary path and silently dropped the out-of-range page in thewalrust-corepath (producing a wrong byte image that still "verified").- Fix: validate
page_size != 0, usechecked_mulfor the image buffer, and reject any page number outside the valid range with a typed error instead of panicking or dropping.
src/sync/restore.rs:181-188- The apply loop set
final_txidand printed "Restored …" then returnedOk(())with no check thatfinal_txid == target_txid. A missing incremental or an end-of-chain gap produced a restore short of the requested point, reported as success → silent data loss. - Fix:
ensure!(final_txid == target_txid, …)after the loop. (Per-file pre/post checksum chaining is already verified inapply_ltx_to_db; this closes the "stopped early" case.)
src/sync/replicate.rs:183-191- On a gap the loop
continued; every later file then also failed the contiguity check, so the replica froze forever whilereplicatereturnedOk. The in-code comment even noted "for now just warn and continue". - Fix: a gap is now a hard error that forces a re-bootstrap from the latest snapshot rather than skipping frames.
crates/walrust-core/src/wal.rs(andsrc/wal.rs)- The production frame readers parsed
page_number/db_sizebut never verified the SQLite WAL cumulative checksum; the commit boundary was "last frame withdb_size > 0". A torn tail frame whose 24-byte header carried a non-zerodb_sizewas accepted as a commit. - Fix: implemented the SQLite WAL checksum (
wal_checksum— the s0/s1 Fibonacci-weighted sum, big-/little-endian per the WAL magic0x377f0682/0x377f0683), plusvalidate_header_checksumandverify_frame_checksum. The production reader is nowread_frames_as_page_map_checked, which seeds the chain from the validated header checksum (or the caller's running chain mid-WAL), verifies each frame, and stops at the first mismatch — a torn tail frame with a bogus non-zerodb_sizeis no longer treated as a commit. The running chain is threaded throughSyncState/DbStateso incremental reads keep validating. Validation is skipped only for synthetic WALs with a zero header checksum (never a real SQLite WAL), so existing hand-built test WALs still parse. Golden-vector tests (test_wal_checksum_golden_vector) verify the algorithm against hand-computed values; torn-tail tests prove valid frames are accepted and corrupt ones rejected, in both crates.
crates/walrust-core/src/sync.rs(all three sync sites),src/sync/wal_sync.rs- Rollover was detected only by
current_size < wal_offset. SQLite can reset the WAL in place with a new salt at the same/larger size; that was missed, so new-generation frames were read as a continuation of the old generation and the new prefix was skipped. - Fix: threaded the WAL header salt into
SyncState(wal_salt) andDbState. All three core sync sites now call a sharedread_next_wal_batchhelper that triggers rollover on a size shrink OR a salt change, resets the offset/generation and re-seeds the checksum chain. The binary sync path does the same two-pronged check inline. Salt is persisted instate.jsonand tracked even on no-op syncs.
F13 — [High] restore_with_snapshot_source / pull_incremental apply with no chain verification — Fixed
crates/walrust-core/src/sync.rs—restore_with_snapshot_source,pull_incremental,pull_incremental_into_sink_inner- All three loops applied changesets in seq order with no
verify_chain, so a stale object from a prior lineage at an in-range seq was applied wholesale. - Fix: thread
current_checksum: Option<u64>through each loop. The first changeset establishes the chain (the base isn't HADBP-encoded, so its prior checksum is unknown); every subsequent changeset is checked withhadb_changeset::physical::verify_chain(prev, &changeset)and the loop breaks on a chain break rather than applying. The sink path verifies before routing any pages so a mis-chained changeset is rejected whole.pull_into_sink_stops_on_broken_chaincovers it; the multi-changeset lifecycle test was updated to seed properly chained fixtures.
src/cache.rs,src/uploader.rs- The exposed cursor advanced on cache-write / max(txid) before the uploader confirmed the PUT; a node reseeded from remote state believed un-uploaded TXIDs were restorable.
- Fix: added
last_contiguous_uploaded_txidto the cache manifest — the highest TXID with a confirmed durable PUT and no gap below it. It advances only insidemark_uploaded(after a confirmed PUT) across the gap-free prefix, never on a mere cache write. The uploader exposes it inUploaderStats.last_contiguous_uploaded_txid. This is the safe restore cursor;last_uploaded_txid(max-based) is kept only for observability.
F9 — [Med] last_uploaded_txid = max(txid) hides a permanent gap; uploader returns Ok on failed PUTs — Fixed
src/cache.rs,src/uploader.rs- Fix:
mark_uploadedadvances the contiguous cursor only across an unbroken1..=Trun. Addedmark_failed+ afailed_txidsset in the manifest; the uploader records every permanently-failed PUT (auth error or retries exhausted) so the gap is durable and surfaced viafailed_uploads()/CacheStats.failed_count(the upload-failed webhook still fires). The contiguous cursor never advances past a failed or missing TXID. Tests cover out-of-order uploads, a failed-then-retried gap, and restart persistence.
src/cache.rs,src/sync/wal_sync.rs- Fix: added an
is_snapshotflag toCacheEntry(set via the newwrite_snapshot_ltx, used for the initial base inwal_sync). Cleanup now computes a floor at the latest cached snapshot and never evicts it or any TXID at/after it (the restore base + its incremental chain), regardless of age ormax_cache_size. Pending (not-yet-durable) uploads were already never evicted. Tests cover keeping a snapshot+chain under aggressive cleanup and evicting a superseded older base.
src/sync/compact.rs- Fix: before deleting,
compactdiscovers the live incremental chain and pulls any reachability base out of the delete set: the highest-TXID snapshot (current restore base) and the latest snapshot at/below the earliest retained incremental's start. Rescued snapshots move tokeepand their bytes are not counted as freed, so a retained incremental chain always has a base.
src/sync/compact.rs,src/sync/replicate.rs,src/sync/manifest.rs,src/s3.rs- The production watch path discovers by S3 listing and never writes the
Manifest, socompactwas a silent no-op andreplicateerrored "No LTX files found". - Fix: added
discover_snapshots_from_s3anddiscover_all_ltx_from_s3to the manifest module (mirroring howverifylists generations).compactnow discovers snapshots from the listing and HEADs each for size/last-modified (s3::head_object_meta) to build retention entries, deleting full S3 keys and no longer reading/writing a manifest.replicatediscovers all LTX files (snapshots + incrementals) from the listing viaDiscoveredLtx.
src/sync/wal_sync.rs- Fix: after the snapshot folds all WAL frames into the base,
take_snapshotnow resetswal_offsetto 0, bumpswal_generation, re-reads the WAL header salt intowal_salt, and clearswal_checksum_chainso the next incremental read re-seeds from the new header (ties into F3). The snapshot'sdb_checksumis the explicit hand-off base for the first incremental.
src/shadow.rs,src/sync/shadow.rs- The writer used
{:08x}(u32 width) for the generation while a test encoder used{:016x}; lexical order broke for generation> 0xFFFF_FFFF. - Fix: one shared
format_segment_name(generation, index)/SEGMENT_HEX_WIDTH = 16used by the writer and the test encoder. Parsing was already width-agnostic (u64::from_str_radix). A test asserts lexical == numeric order pastu32::MAX.
src/sync/manifest.rs,verify.rs- Fix: added one shared
is_snapshot(generation, min_txid, max_txid)helper (generation > 0 || (min == 1 && max == 1)) and routedverify,discover_snapshots_from_s3, anddiscover_all_ltx_from_s3through it.
walrust-dst/src/{mock_storage,chaos,invariants,properties,main,disk_queue_tests}.rs,walrust-dst/Cargo.toml,src/testable.rs,src/lib.rs- The harness now builds and
cargo test -p walrust-dstruns green (58 tests), injecting real faults through the real codec. - Mock rewritten onto the current
hadb_storage::StorageBackendtrait (get/put/delete/list/exists+put_if_absent/put_if_match). The fault model is preserved and honest:PartialWritepersists the truncated prefix then surfaces the error, so a torn object is observable on a laterget(was: stored nothing).EventualConsistencyis gated on a deterministic, seeded operation counter (visible_after_ops), not wall-clock time, so read- and list-after-write staleness is reproducible under a fixed seed (minimum lag of 2 ops guarantees the first read observes the object as not-yet-visible).listhonours the same visibility gate;getreturnsOk(None)for a not-yet-visible object, modelling a stale read.SilentCorruptionflips a real bit in the stored bytes;RandomErrorclassifies as transient so retry can recover.
- New
walrust::testablemodule: snapshot/sync/restore wired straight onto theStorageBackendtrait via the realltxencoder/decoder + checksum chain and the litestream key layout. It is not a second watch loop — an injected fault flows through real encode → PUT → GET → decode → checksum, and a corrupt or torn object is caught by the sameapply_ltx_to_db/decode_to_dbverification the daemon uses._with_retryvariants drive the real retry policy over the fault-prone PUT. - The harness now asserts real outcomes: corruption is detected by
verify_ltx(100% over 20 trials);chaos_s3_errorsrecovers injected transient faults via retry;prop_point_in_time_restorerestores to a TXID and asserts the exact row count;prop_wal_batching_no_lossreplays a snapshot+incremental chain and asserts no frames are lost;prop_recovery_under_failuresnapshots under a 10% error rate and asserts no data loss when restore succeeds. - Build-config clash resolved:
walrust-dstrusqlite pinned to 0.35 (matchingwalrust) and the githadb-*crates patched to the local checkout, so onelibsqlite3-sys(links = "sqlite3") provider and onehadb-storagetrait version exist in the graph. - The pre-F9
disk_queue_testsexpectations were updated to the current cache semantics (a permanently-failed upload moves frompendingintofailed, surfacing the durable gap) and a fixed-sleep multi-DB count assertion was made to poll-until-drained so it is not timing-flaky.
cargo build(thewalrustbin + lib) is green for all Fixed findings.walrustunit tests pass: WAL checksum golden vectors + torn-tail, the restore/pull chain-break test, the cache durable-cursor / failed-gap / floor tests, the shadow segment-width test, plus the pre-existing suites.crates/walrust-corelib tests pass (run in that crate's directory).- Live-network integration tests (S3-backed) are gated and not exercised here;
compact/replicatediscovery (F6) is verified by construction against the litestream layout, not against a live bucket.tests/test_verify.rsrequires S3 credentials and is expected to fail without them. walrust-dstbuilds andcargo test -p walrust-dstruns green (58 tests), injecting real storage faults through the realltxcodec via the newwalrust::testabledriver (see F14).