Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Atomicity and consistency in the resume procedure of the downloader.#5254

Merged
rene merged 1 commit into
lf-edge:masterfrom
jsfakian:WIP-resume-support-EVE-downloader
Oct 8, 2025
Merged

Atomicity and consistency in the resume procedure of the downloader.#5254
rene merged 1 commit into
lf-edge:masterfrom
jsfakian:WIP-resume-support-EVE-downloader

Conversation

@jsfakian

@jsfakian jsfakian commented Sep 25, 2025

Copy link
Copy Markdown
Contributor

Description

This PR hardens the downloader’s persistence and finalization without changing the transport/eve-libs logic:

  1. Atomic, durable progress file:
    • Persist DownloadedParts to .part.progress.json using write → fsync → rename → fsync(dir).
    • Add a self-check hash (sha256) and store ContentLength as a validator.
    • Auto-recover from a valid leftover *.tmp on reboot.
  2. Download into a temp payload
    • Stream bytes into .part (same directory).
    • On success, atomically rename to the final target.
    • Prevents half-written finals from leaking to other agents.

No changes to eve-libs interfaces. Existing WithDoneParts(downloadedParts) resume logic remains; it is now backed by crash-safe, trustworthy state.

How to test and validate this PR

Test resumability during a restart

  1. Onboard an edge device running EVE-OS on a controller (e.g., Zedcloud)
  2. Deploy an edge app on that device with a large image (e.g., more than 5GB).
  3. During the download, restart the device mid-transfer.
    • Expect: .part present; .part.progress.json valid; no final file.
    • Restart: download resumes (eve-libs skips done parts), completes, and renames atomically; progress file removed.

Test resumability during restart when the chunks are corrupted

  1. Onboard an edge device running EVE-OS on a controller (e.g., Zedcloud)
  2. Deploy an edge app on that device with a large image (e.g., more than 5GB).
  3. During the download, corrupt the progress file (edit, truncate), and restart the device mid-transfer.
    • Expect: loader rejects; resume from 0; new progress file created atomically.

Test resumability in a poor networking environment

  1. Onboard an edge device running EVE-OS on a controller (e.g., Zedcloud)
  2. Deploy an edge app on that device with a large image (e.g., more than 5GB).
  3. During the download, the network goes out for a few minutes, so TCP and the download time out..
    • Expect: .part present; .part.progress.json valid; no final file.
    • Restart: download resumes (eve-libs skips done parts), completes, and renames atomically; progress file removed.

Changelog notes

EVE supports incremental and resumable downloads from some datastores types (S3, Azure, ...) by checkpointing state to disk. This provides a few improvement in that area:

  • Avoid truncated/corrupted final files after crashes or power loss.
  • Ensure resumability works reliably across reboots by making progress durable and self-validated.

PR Backports

For all current LTS branches, please state explicitly if this PR should be
backported or not. This section is used by our scripts to track the backports,
so, please, do not omit it.

Here is the list of current LTS branches (it should be always up to date):

- 14.5-stable: To be backported.
- 13.4-stable: To be backported.

Checklist

  • I've provided a proper description
  • I've added the proper documentation
  • I've tested my PR on amd64 device
  • I've tested my PR on arm64 device
  • I've written the test verification instructions
  • I've set the proper labels to this PR
  • I've checked the boxes above, or I've provided a good reason why I didn't
    check them.

@jsfakian jsfakian added bug Something isn't working stable Should be backported to stable release(s) labels Sep 25, 2025
@jsfakian jsfakian force-pushed the WIP-resume-support-EVE-downloader branch from aaff2b0 to 0a9e47a Compare September 25, 2025 15:32
@jsfakian jsfakian force-pushed the WIP-resume-support-EVE-downloader branch from 0a9e47a to 66b6810 Compare September 25, 2025 15:32
@jsfakian jsfakian force-pushed the WIP-resume-support-EVE-downloader branch 3 times, most recently from 40fea4c to 359b4a2 Compare September 25, 2025 15:43

@eriknordmark eriknordmark left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of "Test resumability during a restart" I think there are two cases:

  1. The device reboots/power cycles
  2. The network goes out for a few minutes so TCP and the download times out.

We should test both variants - the desciption currently only has 1.
(And in addition it is important to test the curruption case which is already listed.)

Comment thread pkg/pillar/cmd/downloader/dirs.go Outdated
Comment thread pkg/pillar/cmd/downloader/syncop.go Outdated
Comment thread pkg/pillar/cmd/downloader/syncop.go Outdated
@jsfakian jsfakian force-pushed the WIP-resume-support-EVE-downloader branch from 359b4a2 to 3dc7ecc Compare September 29, 2025 10:37
@jsfakian jsfakian force-pushed the WIP-resume-support-EVE-downloader branch 3 times, most recently from ef50759 to 04e26e5 Compare September 29, 2025 10:52
@OhmSpectator

Copy link
Copy Markdown
Member

In the case of AWS S3 and Azure, we set the cleanOnError to false, and for all other cases, we set the cleanOnError to true.

I thought it was because it wasn't implemented in the zedUpload library... I don't see any code to resume download in the HTTP lib there...

@jsfakian

jsfakian commented Oct 1, 2025

Copy link
Copy Markdown
Contributor Author

In the case of AWS S3 and Azure, we set the cleanOnError to false, and for all other cases, we set the cleanOnError to true.

I thought it was because it wasn't implemented in the zedUpload library... I don't see any code to resume download in the HTTP lib there...

In eve-libs, the downloader periodically records progress (the doneParts) and reports it to EVE for all datastore types—not just AWS and Azure. EVE then persists this progress. On restart or crash recovery, the downloader reads the saved doneParts and resumes from the last completed ranges. Therefore, unless there is a bug in how doneParts is produced, resumable downloads should for all datastore types.

@eriknordmark

Copy link
Copy Markdown
Contributor

You are right that this PR does not add new functionality for download resumability (it was already present). But we will support resuming downloads from HTTP/HTTPS datastores. We had a variable called cleanOnError which prevented the resumability if it was set to true. In the case of AWS S3 and Azure, we set the cleanOnError to false, and for all other cases, we set the cleanOnError to true. That is why the downloader resumed only when the datastore was AWS S3 or Azure.

It would make sense to update the release notes part of the description with that context (that it makes it more robust for S3, etc).

In terms of https:
Does the http/https downloader have the ability to download in chunks from a specified offset?
And do all of the https servers support this?
It might make sense to explore this a bit outside of this PR to see whether we can enable the resume for it.

Note that if a file changes or is deleted on the server (whether it is S3, Azure, or https) the result might be a failure of the interrupted and resumed download, or it will complete and the sha verification will fail. I think we have added code to automatically retry even when the sha verification fails, but it would make sense to check that as well.

@OhmSpectator

Copy link
Copy Markdown
Member

On restart or crash recovery, the downloader reads the saved doneParts and resumes from the last completed ranges

I don't see the HTTP library handling any doneParts or similar entities.
https://github.com/lf-edge/eve-libs/blob/main/zedUpload/datastore_http.go

So, my understanding is that HTTP datastores do not support it in our case until we implement it in zedUpload.

@jsfakian

jsfakian commented Oct 1, 2025

Copy link
Copy Markdown
Contributor Author

On restart or crash recovery, the downloader reads the saved doneParts and resumes from the last completed ranges

I don't see the HTTP library handling any doneParts or similar entities. https://github.com/lf-edge/eve-libs/blob/main/zedUpload/datastore_http.go

So, my understanding is that HTTP datastores do not support it in our case until we implement it in zedUpload.

You are right @OhmSpectator, I got confused due to the function statsUpdater in the following code:

func (ep *HttpTransportMethod) processHttpDownload(req *DronaRequest) (error, int) {
	file := req.name
	if ep.hurl != "" {
		file = ep.hurl + "/" + ep.path + "/" + req.name
	}
	prgChan := make(types.StatsNotifChan)
	defer close(prgChan)
	if req.ackback {
		go statsUpdater(req, ep.ctx, prgChan)
	}
	hClient, err := ep.hClientWrap.unwrap()
	if err != nil {
		return err, 0
	}
	stats, resp := zedHttp.ExecCmd(req.cancelContext, "get", file, "",
		req.objloc, req.sizelimit, prgChan, hClient, ep.inactivityTimeout)
	return stats.Error, resp.BodyLength
}

func reqPostSize(req *DronaRequest, dronaCtx *DronaCtx, stats types.UpdateStats) {
	req.doneParts = stats.DoneParts
	dronaCtx.postSize(req, stats.Size, stats.Asize)
}

func statsUpdater(req *DronaRequest, dronaCtx *DronaCtx, prgNotif types.StatsNotifChan) {
	ticker := time.NewTicker(StatsUpdateTicker)
	defer ticker.Stop()
	var newStats, stats types.UpdateStats
	var ok bool
	for {
		select {
		case newStats, ok = <-prgNotif:
			if !ok {
				reqPostSize(req, dronaCtx, stats)
				return
			}
			stats = newStats
		case <-ticker.C:
			reqPostSize(req, dronaCtx, stats)
		}
	}
}

The function updates the doneParts of the drona request with the doneParts of the stats variable. However, in the case of http datastore we do not update the doneParts of the stat variable:

for {
	var copyErr error

	written, copyErr = io.CopyN(local, inactivityReader, chunkSize)
	copiedSize += written
	stats.Asize = copiedSize

	// possible situations:
	// err != nil && err == io.EOF - end of file, wrap up and return
	// err != nil && err == inactivityTimeout - begin a retry
	// err != nil - wrap up and return
	// err == nil - update stats and keep reading
	switch {
	case copyErr != nil && errors.Is(copyErr, io.EOF) && copiedSize != objSize && objSize != 0:
		appendToErrorList("premature EOF after %d out of %d bytes: %+v", copiedSize, objSize, copyErr)
		return stats, rsp
	case copyErr != nil && errors.Is(copyErr, io.EOF):
		// empty out the error list
		errorList = nil
		return stats, rsp
	case copyErr != nil && errors.Is(copyErr, &ErrTimeout{}):
		// the error comes from timeout
		appendToErrorList("inactivity for %s", inactivityTimeout)
	case copyErr != nil:
		appendToErrorList("error from CopyN after %d out of %d bytes: %v", copiedSize, objSize, copyErr)
		return stats, rsp
	default:
		// no error, so just continue
		types.SendStats(prgNotify, stats)
		continue
	}
	// every other case either returns or continues; if we made it here,
	// break io.CopyN loop, forcing a retry of the outer loop
	break
}

We only update the Asize of the stats, not the doneParts.

We need to make two changes: first update the doneParts of the stats, e.g.:

stats.Asize = copiedSize
if written > 0 {
    // adapt to your actual Part struct (Start/End or Offset/Size)
    stats.DoneParts = types.DownloadedParts{
        PartSize: copiedSize, // or your configured chunk size
        Parts:    []*types.PartDefinition{{Ind: i, Size: written}},
    }
}

and second start from the saved doneParts in the local file and add a range in the header of the request.

@eriknordmark

Copy link
Copy Markdown
Contributor

@jsfakian @OhmSpectator I took the liberty to add a pre-amble sentence to the Changelog notes
section. Please take a look and tweak it as needed.

@jsfakian

jsfakian commented Oct 2, 2025

Copy link
Copy Markdown
Contributor Author

@jsfakian @OhmSpectator I took the liberty to add a pre-amble sentence to the Changelog notes section. Please take a look and tweak it as needed.

LGTM

@eriknordmark eriknordmark requested a review from europaul October 2, 2025 11:54
@OhmSpectator

Copy link
Copy Markdown
Member

@jsfakian @OhmSpectator I took the liberty to add a pre-amble sentence to the Changelog notes section. Please take a look and tweak it as needed.

LGTM

I would be more specific

some datastores types (S3, Azure, ...)

@europaul europaul left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eriknordmark

Copy link
Copy Markdown
Contributor

@jsfakian The eden eve update (zfs) is failing and the log shows
"{"file":"/pillar/cmd/volumemgr/updatestatus.go:216","func":"github.com/lf-edge/eve/pkg/pillar/cmd/volumemgr.doUpdateContentTree","level":"error","msg":"doUpdateContentTree: BlobStatus(e7bf444fec8a84cbbb56f63c6199c4ea4c62a308394587b2a54de3e237532ba5) has error: open /persist/vault/downloader/pending/d78fe4a159e014bdce546470c5d557406e38ac3b08947d746265009317b87356.e7bf444fec8a84cbbb56f63c6199c4ea4c62a308394587b2a54de3e237532ba5.part: no such file or directory\n","pid":2520,"source":"volumemgr","time":"2025-10-02T15:29:05.292749332Z"}",

".part" sounds like it might be related to this PR. Can you take a look?

@jsfakian jsfakian force-pushed the WIP-resume-support-EVE-downloader branch from 244f652 to 54c31db Compare October 3, 2025 14:44
@jsfakian

jsfakian commented Oct 3, 2025

Copy link
Copy Markdown
Contributor Author

@jsfakian The eden eve update (zfs) is failing and the log shows "{"file":"/pillar/cmd/volumemgr/updatestatus.go:216","func":"github.com/lf-edge/eve/pkg/pillar/cmd/volumemgr.doUpdateContentTree","level":"error","msg":"doUpdateContentTree: BlobStatus(e7bf444fec8a84cbbb56f63c6199c4ea4c62a308394587b2a54de3e237532ba5) has error: open /persist/vault/downloader/pending/d78fe4a159e014bdce546470c5d557406e38ac3b08947d746265009317b87356.e7bf444fec8a84cbbb56f63c6199c4ea4c62a308394587b2a54de3e237532ba5.part: no such file or directory\n","pid":2520,"source":"volumemgr","time":"2025-10-02T15:29:05.292749332Z"}",

".part" sounds like it might be related to this PR. Can you take a look?

This is strange, I run the EDEN test for EVE upgrade zfs locally, and it works. One line that might be suspicious is the following:

f, ferr := os.OpenFile(tempLocFilename, os.O_RDWR, 0)

@jsfakian jsfakian force-pushed the WIP-resume-support-EVE-downloader branch 6 times, most recently from 816bc64 to fda7cb3 Compare October 5, 2025 13:07
@codecov

codecov Bot commented Oct 5, 2025

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 19.50%. Comparing base (d1b24f1) to head (fda7cb3).
⚠️ Report is 41 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5254      +/-   ##
==========================================
+ Coverage   18.52%   19.50%   +0.98%     
==========================================
  Files          19       19              
  Lines        2705     3025     +320     
==========================================
+ Hits          501      590      +89     
- Misses       2120     2314     +194     
- Partials       84      121      +37     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jsfakian

jsfakian commented Oct 5, 2025

Copy link
Copy Markdown
Contributor Author

@jsfakian The eden eve update (zfs) is failing and the log shows "{"file":"/pillar/cmd/volumemgr/updatestatus.go:216","func":"github.com/lf-edge/eve/pkg/pillar/cmd/volumemgr.doUpdateContentTree","level":"error","msg":"doUpdateContentTree: BlobStatus(e7bf444fec8a84cbbb56f63c6199c4ea4c62a308394587b2a54de3e237532ba5) has error: open /persist/vault/downloader/pending/d78fe4a159e014bdce546470c5d557406e38ac3b08947d746265009317b87356.e7bf444fec8a84cbbb56f63c6199c4ea4c62a308394587b2a54de3e237532ba5.part: no such file or directory\n","pid":2520,"source":"volumemgr","time":"2025-10-02T15:29:05.292749332Z"}",

".part" sounds like it might be related to this PR. Can you take a look?

I have verified that this error comes from the following code:

// Finalize: fsync and atomically move tempLocFilename -> locFilename.
		f, ferr := os.OpenFile(tmpLocFilename, os.O_RDWR, 0644)
		if ferr != nil {
			log.Errorf("Failed to open file %s: %v", tmpLocFilename, ferr)
			return handleSyncOpResponse(ctx, config, status, locFilename,

But I cannot think why the tmpLocFilename might not be present.

@jsfakian jsfakian force-pushed the WIP-resume-support-EVE-downloader branch 4 times, most recently from 3ce7638 to 0ce79cf Compare October 6, 2025 04:44
…ader.

This PR hardens the downloader’s persistence, consistency, and resumability without changing the transport/eve-libs logic:
- Persist DownloadedParts to <target>.part.progress.json using write → fsync → rename → fsync(dir).
- Add a self-check hash (sha256) as a validator.
- Auto-recover from a valid leftover *.tmp on reboot.

- Stream bytes into <target>.part (same directory).
- On success, atomically rename to the final target.
- Prevents half-written finals from leaking to other agents.

With the previous approach:
- Progress JSON could be truncated on crash; loader trusted whatever it could decode.
- Bytes were written directly to the final path; a crash could leave a half-written final.
- Remote object change (same name, different size) could continue appending mismatched data.

With this PR:
- Progress JSON is atomic & fsync’d; includes a hash; loader rejects corrupted files and can promote a valid .tmp.
- Bytes go to <target>.part; only after successful completion do we rename to <target> atomically.

Signed-off-by: Ioannis Sfakianakis <[email protected]>
@jsfakian jsfakian force-pushed the WIP-resume-support-EVE-downloader branch from 0ce79cf to f9daca1 Compare October 6, 2025 11:02
@jsfakian

jsfakian commented Oct 6, 2025

Copy link
Copy Markdown
Contributor Author

@jsfakian The eden eve update (zfs) is failing and the log shows "{"file":"/pillar/cmd/volumemgr/updatestatus.go:216","func":"github.com/lf-edge/eve/pkg/pillar/cmd/volumemgr.doUpdateContentTree","level":"error","msg":"doUpdateContentTree: BlobStatus(e7bf444fec8a84cbbb56f63c6199c4ea4c62a308394587b2a54de3e237532ba5) has error: open /persist/vault/downloader/pending/d78fe4a159e014bdce546470c5d557406e38ac3b08947d746265009317b87356.e7bf444fec8a84cbbb56f63c6199c4ea4c62a308394587b2a54de3e237532ba5.part: no such file or directory\n","pid":2520,"source":"volumemgr","time":"2025-10-02T15:29:05.292749332Z"}",

".part" sounds like it might be related to this PR. Can you take a look?

I removed the part where we use the .tmp filename, and now the tests for the upgrade of ZFS and EXT4 have passed. Are we safe to merge?

@rene rene merged commit 6debedd into lf-edge:master Oct 8, 2025
43 of 52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working stable Should be backported to stable release(s)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants