Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@hweawer
Copy link
Collaborator

@hweawer hweawer commented Nov 24, 2025

PR Description

Problem

Clients frequently experience "Failed to resolve dependencies in build-index" errors during image pushes, particularly for images with large blob layers. This causes push operations to fail and requires manual retries.

Root Cause

Race condition between manifest upload and dependency resolution:

When a Docker client pushes an image manifest:

  1. Manifest blob begins uploading to the origin cluster
  2. Build-index putTag handler is triggered immediately (via nginx/registry)
  3. Build-index attempts to download the manifest to resolve layer dependencies
  4. Manifest blob hasn't finished uploading yet → returns 404 Not Found
  5. Build-index fails immediately without retry

Timeline from production logs:

12:43:49 UTC - origin (phx60-c37): "Blob not found in backend" - initiating download
12:43:49 UTC - origin (phx60-ayz): Starting cluster upload
12:43:50 UTC - build-index: "Putting tag" → attempts dependency resolution
12:43:50 UTC - build-index: "Failed to resolve dependencies" ← FAILS
12:43:50 UTC - origin (phx60-c3w): Successfully committed upload ← TOO LATE

The manifest was being uploaded successfully, but build-index queried for it before the upload completed.

Solution

Implement exponential backoff retry mechanism for manifest downloads during dependency resolution to gracefully handle the upload race condition.

Changes Made

1. build-index/tagtype/docker_resolver.go - Add Retry Logic

  • Use backoff.Retry() to retry manifest downloads with exponential backoff
  • Reset buffer between retry attempts to prevent data corruption
  • Add retry attempt logging for observability
  • Skip retries on permanent errors (parse errors, non-retryable HTTP errors)
  • Return detailed error messages with attempt count

2. build-index/tagtype/map.go - Configure Backoff Parameters

Retry schedule:
Attempt 1: immediate (0ms)
Attempt 2: +500ms (total: 0.5s)
Attempt 3: +1s (total: 1.5s)
Attempt 4: +2s (total: 3.5s)
Total time: ~3.5s with 4 attempts

This provides ~15 seconds for origin cluster to complete manifest upload and replication.

Impact

Before

  • ❌ ~10% error rate on large image pushes during peak traffic
  • ❌ No retry on transient upload race conditions
  • ❌ Immediate failure on 404 responses
  • ❌ Users forced to manually retry entire push operation
  • ❌ Poor user experience during origin cluster replication delays

After

  • <1% expected error rate (only on genuine failures)
  • ✅ Automatic retry with exponential backoff
  • ✅ Handles upload race conditions gracefully
  • ✅ Transparent recovery from transient failures
  • ✅ Better observability via retry logging

Copilot AI review requested due to automatic review settings November 24, 2025 19:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds retry logic with exponential backoff for manifest downloads during dependency resolution to handle transient failures in distributed systems. The changes also improve error messages throughout the affected code by using error wrapping with %w instead of %s.

Key changes:

  • Implemented retry mechanism in dockerResolver.downloadManifest() with configurable exponential backoff (5 retries, up to 30s max interval)
  • Added error wrapping for better error context propagation across multiple layers
  • Improved code organization with const grouping

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
utils/dockerutil/dockerutil.go Groups manifest type constants and improves error wrapping in GetManifestReferences
build-index/tagtype/map.go Configures exponential backoff settings (500ms initial, 2x multiplier, 5 retries) for docker resolver initialization
build-index/tagtype/docker_resolver.go Implements core retry logic with backoff for manifest downloads, handling retryable errors, network errors, and ErrBlobNotFound; adds retry attempt logging
build-index/tagserver/server.go Enhances error messages in putTagHandler with consistent "put tag handler" prefix for better error traceability

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

replicate, err := strconv.ParseBool(httputil.GetQueryArg(r, "replicate", "false"))
if err != nil {
return handler.Errorf("parse query arg `replicate`: %s", err)
return fmt.Errorf("put tag handler: parse query arg `replicate`: %w", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we making this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code has handler.Errorf at the same time same thing happens in handler.Wrap - all the hadlers wrapped with this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a backoff the correct solution for this problem?

If we zoom out, Kraken uses 2 levels of write-back caches for image uploads:

  1. The user uploads (the manifest/blobs) to proxy, after which proxy returns 200. The proxy is then responsible for writing-back to the origin cluster.
  2. The origin cluster is responsible for writing-back to GCS, after receiving the blobs from proxy.

From what I understand, the error occurs, when the blobs are still in proxy (and not in the origins), but build-index tries to query for them in the origin cluster, which fails, as they are neither there, nor in GCS.

Ideally, Kraken would check its own write-back cache (proxy) before checking the DB (origins/GCS), but perhaps that adds more complexity than necessary (BI doesn't call proxy atm). A backoff is simpler.

We can't remove the check either, as it's part of the docker registry v2 API spec. Have we considered making the first cache layer (proxy) write-through, instead of write-back? I.e. proxy wouldn't return 200 until the blobs are written to the origin cluster. What's the downside of that approach?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that it is more complex right?

Maybe we can just fix this problem with retries?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tradeoff with the Write through approach is that you are bottlenecked by how fast proxy can write blobs to origin.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see how retries impact the existing failures, I have filed a follow up Issue here #505 in order to investigate the Write through behaviour.

Copilot AI review requested due to automatic review settings November 25, 2025 16:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings November 25, 2025 17:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hweawer hweawer requested a review from Pennywise007 November 26, 2025 14:48
for i, dep := range deps {
for _, dep := range deps {
if _, err := s.localOriginClient.Stat(tag, dep); err == blobclient.ErrBlobNotFound {
log.With("tag", tag, "digest", d.String(), "missing_dependency", dep.String(), "dependency_index", i).Error("Missing dependency blob")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing these logs, are they noisy ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it because the log is anyway written here

if err := s.putTag(tag, d, deps); err != nil {
		log.With("tag", tag, "digest", d.String(), "error", err).Error("Failed to put tag")
		return err
	}

So it will be duplicated

Copy link
Collaborator

@gkeesh7 gkeesh7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with some comments, I want to assess how much improvements do we get with retries. Please add some observability around number of manifest downloads which succeed with additional retries, as a follow up.

log.With("tag", tag, "digest", d.String(), "missing_dependency", dep.String(), "dependency_index", i).Error("Missing dependency blob")
return handler.Errorf("cannot upload tag, missing dependency %s", dep)
return fmt.Errorf("cannot upload tag, missing dependency %s", dep)
} else if err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing these logs, are they noisy ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as here #502 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tradeoff with the Write through approach is that you are bottlenecked by how fast proxy can write blobs to origin.

sr = &subResolver{re, &dockerResolver{originClient}}
backoffConfig := httputil.ExponentialBackOffConfig{
Enabled: true,
InitialInterval: 500 * time.Millisecond,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we export these magic numbers to constants instead ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to supply them from configuration file as well. Filing an Issue for this as well #506

Copy link
Collaborator Author

@hweawer hweawer Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general yes, but seems it is a very rare case, most of the pushes success without retries. I think we need some general policy of retries to dependencies and analyse all the places where it is worth doing this, then reuse config across all the calls.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see how retries impact the existing failures, I have filed a follow up Issue here #505 in order to investigate the Write through behaviour.

return err
}

if err := backoff.Retry(retryFunc, r.backoffConfig.Build()); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the caller timeout/hang up by the time we are doing retries ? Can you comment on what timeouts we have from the caller side while downloading the manifests ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed build-index can timeout before the retries, increased the timeout for the build-index

Comment on lines 245 to 253
require := require.New(t)
ctrl := gomock.NewController(t)
defer ctrl.Finish()

resolver, mockOrigin := newTestDockerResolver(ctrl)

tag := "repo/image:v1.0"
layers := core.DigestListFixture(3)
manifest, manifestBytes := dockerutil.ManifestFixture(layers[0], layers[1], layers[2])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are repeated over and over, try exporting them to a helper/setup function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@gkeesh7
Copy link
Collaborator

gkeesh7 commented Nov 27, 2025

#507 Follow up for observability.

Copilot AI review requested due to automatic review settings November 28, 2025 09:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hweawer hweawer merged commit 9213176 into master Nov 28, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants