-
Notifications
You must be signed in to change notification settings - Fork 463
fix(build-index): Add retry logic for manifest download during dependency resolution #502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(build-index): Add retry logic for manifest download during dependency resolution #502
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds retry logic with exponential backoff for manifest downloads during dependency resolution to handle transient failures in distributed systems. The changes also improve error messages throughout the affected code by using error wrapping with %w instead of %s.
Key changes:
- Implemented retry mechanism in
dockerResolver.downloadManifest()with configurable exponential backoff (5 retries, up to 30s max interval) - Added error wrapping for better error context propagation across multiple layers
- Improved code organization with const grouping
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| utils/dockerutil/dockerutil.go | Groups manifest type constants and improves error wrapping in GetManifestReferences |
| build-index/tagtype/map.go | Configures exponential backoff settings (500ms initial, 2x multiplier, 5 retries) for docker resolver initialization |
| build-index/tagtype/docker_resolver.go | Implements core retry logic with backoff for manifest downloads, handling retryable errors, network errors, and ErrBlobNotFound; adds retry attempt logging |
| build-index/tagserver/server.go | Enhances error messages in putTagHandler with consistent "put tag handler" prefix for better error traceability |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
build-index/tagserver/server.go
Outdated
| replicate, err := strconv.ParseBool(httputil.GetQueryArg(r, "replicate", "false")) | ||
| if err != nil { | ||
| return handler.Errorf("parse query arg `replicate`: %s", err) | ||
| return fmt.Errorf("put tag handler: parse query arg `replicate`: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we making this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code has handler.Errorf at the same time same thing happens in handler.Wrap - all the hadlers wrapped with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a backoff the correct solution for this problem?
If we zoom out, Kraken uses 2 levels of write-back caches for image uploads:
- The user uploads (the manifest/blobs) to proxy, after which proxy returns 200. The proxy is then responsible for writing-back to the origin cluster.
- The origin cluster is responsible for writing-back to GCS, after receiving the blobs from proxy.
From what I understand, the error occurs, when the blobs are still in proxy (and not in the origins), but build-index tries to query for them in the origin cluster, which fails, as they are neither there, nor in GCS.
Ideally, Kraken would check its own write-back cache (proxy) before checking the DB (origins/GCS), but perhaps that adds more complexity than necessary (BI doesn't call proxy atm). A backoff is simpler.
We can't remove the check either, as it's part of the docker registry v2 API spec. Have we considered making the first cache layer (proxy) write-through, instead of write-back? I.e. proxy wouldn't return 200 until the blobs are written to the origin cluster. What's the downside of that approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that it is more complex right?
Maybe we can just fix this problem with retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tradeoff with the Write through approach is that you are bottlenecked by how fast proxy can write blobs to origin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see how retries impact the existing failures, I have filed a follow up Issue here #505 in order to investigate the Write through behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for i, dep := range deps { | ||
| for _, dep := range deps { | ||
| if _, err := s.localOriginClient.Stat(tag, dep); err == blobclient.ErrBlobNotFound { | ||
| log.With("tag", tag, "digest", d.String(), "missing_dependency", dep.String(), "dependency_index", i).Error("Missing dependency blob") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we removing these logs, are they noisy ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it because the log is anyway written here
if err := s.putTag(tag, d, deps); err != nil {
log.With("tag", tag, "digest", d.String(), "error", err).Error("Failed to put tag")
return err
}
So it will be duplicated
gkeesh7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with some comments, I want to assess how much improvements do we get with retries. Please add some observability around number of manifest downloads which succeed with additional retries, as a follow up.
| log.With("tag", tag, "digest", d.String(), "missing_dependency", dep.String(), "dependency_index", i).Error("Missing dependency blob") | ||
| return handler.Errorf("cannot upload tag, missing dependency %s", dep) | ||
| return fmt.Errorf("cannot upload tag, missing dependency %s", dep) | ||
| } else if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we removing these logs, are they noisy ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as here #502 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tradeoff with the Write through approach is that you are bottlenecked by how fast proxy can write blobs to origin.
build-index/tagtype/map.go
Outdated
| sr = &subResolver{re, &dockerResolver{originClient}} | ||
| backoffConfig := httputil.ExponentialBackOffConfig{ | ||
| Enabled: true, | ||
| InitialInterval: 500 * time.Millisecond, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we export these magic numbers to constants instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to supply them from configuration file as well. Filing an Issue for this as well #506
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general yes, but seems it is a very rare case, most of the pushes success without retries. I think we need some general policy of retries to dependencies and analyse all the places where it is worth doing this, then reuse config across all the calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see how retries impact the existing failures, I have filed a follow up Issue here #505 in order to investigate the Write through behaviour.
| return err | ||
| } | ||
|
|
||
| if err := backoff.Retry(retryFunc, r.backoffConfig.Build()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the caller timeout/hang up by the time we are doing retries ? Can you comment on what timeouts we have from the caller side while downloading the manifests ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed build-index can timeout before the retries, increased the timeout for the build-index
| require := require.New(t) | ||
| ctrl := gomock.NewController(t) | ||
| defer ctrl.Finish() | ||
|
|
||
| resolver, mockOrigin := newTestDockerResolver(ctrl) | ||
|
|
||
| tag := "repo/image:v1.0" | ||
| layers := core.DigestListFixture(3) | ||
| manifest, manifestBytes := dockerutil.ManifestFixture(layers[0], layers[1], layers[2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines are repeated over and over, try exporting them to a helper/setup function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
#507 Follow up for observability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
PR Description
Problem
Clients frequently experience "Failed to resolve dependencies in build-index" errors during image pushes, particularly for images with large blob layers. This causes push operations to fail and requires manual retries.
Root Cause
Race condition between manifest upload and dependency resolution:
When a Docker client pushes an image manifest:
putTaghandler is triggered immediately (via nginx/registry)Timeline from production logs:
The manifest was being uploaded successfully, but build-index queried for it before the upload completed.
Solution
Implement exponential backoff retry mechanism for manifest downloads during dependency resolution to gracefully handle the upload race condition.
Changes Made
1.
build-index/tagtype/docker_resolver.go- Add Retry Logicbackoff.Retry()to retry manifest downloads with exponential backoff2.
build-index/tagtype/map.go- Configure Backoff ParametersRetry schedule:
Attempt 1: immediate (0ms)
Attempt 2: +500ms (total: 0.5s)
Attempt 3: +1s (total: 1.5s)
Attempt 4: +2s (total: 3.5s)
Total time: ~3.5s with 4 attempts
This provides ~15 seconds for origin cluster to complete manifest upload and replication.
Impact
Before
After