Atomicity and consistency in the resume procedure of the downloader.#5254
Conversation
aaff2b0 to
0a9e47a
Compare
0a9e47a to
66b6810
Compare
40fea4c to
359b4a2
Compare
eriknordmark
left a comment
There was a problem hiding this comment.
In terms of "Test resumability during a restart" I think there are two cases:
- The device reboots/power cycles
- The network goes out for a few minutes so TCP and the download times out.
We should test both variants - the desciption currently only has 1.
(And in addition it is important to test the curruption case which is already listed.)
359b4a2 to
3dc7ecc
Compare
ef50759 to
04e26e5
Compare
I thought it was because it wasn't implemented in the zedUpload library... I don't see any code to resume download in the HTTP lib there... |
In eve-libs, the downloader periodically records progress (the doneParts) and reports it to EVE for all datastore types—not just AWS and Azure. EVE then persists this progress. On restart or crash recovery, the downloader reads the saved doneParts and resumes from the last completed ranges. Therefore, unless there is a bug in how doneParts is produced, resumable downloads should for all datastore types. |
It would make sense to update the release notes part of the description with that context (that it makes it more robust for S3, etc). In terms of https: Note that if a file changes or is deleted on the server (whether it is S3, Azure, or https) the result might be a failure of the interrupted and resumed download, or it will complete and the sha verification will fail. I think we have added code to automatically retry even when the sha verification fails, but it would make sense to check that as well. |
I don't see the HTTP library handling any doneParts or similar entities. So, my understanding is that HTTP datastores do not support it in our case until we implement it in zedUpload. |
You are right @OhmSpectator, I got confused due to the function statsUpdater in the following code: The function updates the doneParts of the drona request with the doneParts of the stats variable. However, in the case of http datastore we do not update the doneParts of the stat variable: We only update the Asize of the stats, not the doneParts. We need to make two changes: first update the doneParts of the stats, e.g.: and second start from the saved doneParts in the local file and add a range in the header of the request. |
|
@jsfakian @OhmSpectator I took the liberty to add a pre-amble sentence to the Changelog notes |
LGTM |
I would be more specific
|
|
@jsfakian The eden eve update (zfs) is failing and the log shows ".part" sounds like it might be related to this PR. Can you take a look? |
244f652 to
54c31db
Compare
This is strange, I run the EDEN test for EVE upgrade zfs locally, and it works. One line that might be suspicious is the following:
|
816bc64 to
fda7cb3
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5254 +/- ##
==========================================
+ Coverage 18.52% 19.50% +0.98%
==========================================
Files 19 19
Lines 2705 3025 +320
==========================================
+ Hits 501 590 +89
- Misses 2120 2314 +194
- Partials 84 121 +37 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
I have verified that this error comes from the following code: But I cannot think why the tmpLocFilename might not be present. |
3ce7638 to
0ce79cf
Compare
…ader. This PR hardens the downloader’s persistence, consistency, and resumability without changing the transport/eve-libs logic: - Persist DownloadedParts to <target>.part.progress.json using write → fsync → rename → fsync(dir). - Add a self-check hash (sha256) as a validator. - Auto-recover from a valid leftover *.tmp on reboot. - Stream bytes into <target>.part (same directory). - On success, atomically rename to the final target. - Prevents half-written finals from leaking to other agents. With the previous approach: - Progress JSON could be truncated on crash; loader trusted whatever it could decode. - Bytes were written directly to the final path; a crash could leave a half-written final. - Remote object change (same name, different size) could continue appending mismatched data. With this PR: - Progress JSON is atomic & fsync’d; includes a hash; loader rejects corrupted files and can promote a valid .tmp. - Bytes go to <target>.part; only after successful completion do we rename to <target> atomically. Signed-off-by: Ioannis Sfakianakis <[email protected]>
0ce79cf to
f9daca1
Compare
I removed the part where we use the |
Description
This PR hardens the downloader’s persistence and finalization without changing the transport/eve-libs logic:
No changes to eve-libs interfaces. Existing WithDoneParts(downloadedParts) resume logic remains; it is now backed by crash-safe, trustworthy state.
How to test and validate this PR
Test resumability during a restart
Test resumability during restart when the chunks are corrupted
Test resumability in a poor networking environment
Changelog notes
EVE supports incremental and resumable downloads from some datastores types (S3, Azure, ...) by checkpointing state to disk. This provides a few improvement in that area:
PR Backports
For all current LTS branches, please state explicitly if this PR should be
backported or not. This section is used by our scripts to track the backports,
so, please, do not omit it.
Here is the list of current LTS branches (it should be always up to date):
Checklist
check them.