fix(build-index): Add retry logic for manifest download during dependency resolution #502

hweawer · 2025-11-24T19:49:01Z

PR Description

Problem

Clients frequently experience "Failed to resolve dependencies in build-index" errors during image pushes, particularly for images with large blob layers. This causes push operations to fail and requires manual retries.

Root Cause

Race condition between manifest upload and dependency resolution:

When a Docker client pushes an image manifest:

Manifest blob begins uploading to the origin cluster
Build-index putTag handler is triggered immediately (via nginx/registry)
Build-index attempts to download the manifest to resolve layer dependencies
Manifest blob hasn't finished uploading yet → returns 404 Not Found
Build-index fails immediately without retry

Timeline from production logs:

12:43:49 UTC - origin (phx60-c37): "Blob not found in backend" - initiating download
12:43:49 UTC - origin (phx60-ayz): Starting cluster upload
12:43:50 UTC - build-index: "Putting tag" → attempts dependency resolution
12:43:50 UTC - build-index: "Failed to resolve dependencies" ← FAILS
12:43:50 UTC - origin (phx60-c3w): Successfully committed upload ← TOO LATE

The manifest was being uploaded successfully, but build-index queried for it before the upload completed.

Solution

Implement exponential backoff retry mechanism for manifest downloads during dependency resolution to gracefully handle the upload race condition.

Changes Made

1. `build-index/tagtype/docker_resolver.go` - Add Retry Logic

Use backoff.Retry() to retry manifest downloads with exponential backoff
Reset buffer between retry attempts to prevent data corruption
Add retry attempt logging for observability
Skip retries on permanent errors (parse errors, non-retryable HTTP errors)
Return detailed error messages with attempt count

2. `build-index/tagtype/map.go` - Configure Backoff Parameters

Retry schedule:
Attempt 1: immediate (0ms)
Attempt 2: +500ms (total: 0.5s)
Attempt 3: +1s (total: 1.5s)
Attempt 4: +2s (total: 3.5s)
Total time: ~3.5s with 4 attempts

This provides ~15 seconds for origin cluster to complete manifest upload and replication.

Impact

Before

❌ ~10% error rate on large image pushes during peak traffic
❌ No retry on transient upload race conditions
❌ Immediate failure on 404 responses
❌ Users forced to manually retry entire push operation
❌ Poor user experience during origin cluster replication delays

After

✅ <1% expected error rate (only on genuine failures)
✅ Automatic retry with exponential backoff
✅ Handles upload race conditions gracefully
✅ Transparent recovery from transient failures
✅ Better observability via retry logging

…ency resolution

Copilot

Pull request overview

This PR adds retry logic with exponential backoff for manifest downloads during dependency resolution to handle transient failures in distributed systems. The changes also improve error messages throughout the affected code by using error wrapping with %w instead of %s.

Key changes:

Implemented retry mechanism in dockerResolver.downloadManifest() with configurable exponential backoff (5 retries, up to 30s max interval)
Added error wrapping for better error context propagation across multiple layers
Improved code organization with const grouping

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
utils/dockerutil/dockerutil.go	Groups manifest type constants and improves error wrapping in `GetManifestReferences`
build-index/tagtype/map.go	Configures exponential backoff settings (500ms initial, 2x multiplier, 5 retries) for docker resolver initialization
build-index/tagtype/docker_resolver.go	Implements core retry logic with backoff for manifest downloads, handling retryable errors, network errors, and ErrBlobNotFound; adds retry attempt logging
build-index/tagserver/server.go	Enhances error messages in `putTagHandler` with consistent "put tag handler" prefix for better error traceability

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

build-index/tagserver/server.go

Anton-Kalpakchiev · 2025-11-25T15:18:02Z

build-index/tagserver/server.go

 	replicate, err := strconv.ParseBool(httputil.GetQueryArg(r, "replicate", "false"))
 	if err != nil {
-		return handler.Errorf("parse query arg `replicate`: %s", err)
+		return fmt.Errorf("put tag handler: parse query arg `replicate`: %w", err)


Why are we making this change?

The code has handler.Errorf at the same time same thing happens in handler.Wrap - all the hadlers wrapped with this

build-index/tagserver/server.go

Anton-Kalpakchiev · 2025-11-25T15:44:37Z

build-index/tagserver/server.go

Is a backoff the correct solution for this problem?

If we zoom out, Kraken uses 2 levels of write-back caches for image uploads:

The user uploads (the manifest/blobs) to proxy, after which proxy returns 200. The proxy is then responsible for writing-back to the origin cluster.

The origin cluster is responsible for writing-back to GCS, after receiving the blobs from proxy.

From what I understand, the error occurs, when the blobs are still in proxy (and not in the origins), but build-index tries to query for them in the origin cluster, which fails, as they are neither there, nor in GCS.

Ideally, Kraken would check its own write-back cache (proxy) before checking the DB (origins/GCS), but perhaps that adds more complexity than necessary (BI doesn't call proxy atm). A backoff is simpler.

We can't remove the check either, as it's part of the docker registry v2 API spec. Have we considered making the first cache layer (proxy) write-through, instead of write-back? I.e. proxy wouldn't return 200 until the blobs are written to the origin cluster. What's the downside of that approach?

Seems that it is more complex right?

Maybe we can just fix this problem with retries?

The tradeoff with the Write through approach is that you are bottlenecked by how fast proxy can write blobs to origin.

I would like to see how retries impact the existing failures, I have filed a follow up Issue here #505 in order to investigate the Write through behaviour.

utils/dockerutil/dockerutil.go

build-index/tagtype/docker_resolver.go

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

build-index/tagtype/docker_resolver.go

build-index/tagserver/server.go

build-index/tagtype/docker_resolver.go

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

build-index/tagtype/docker_resolver.go

gkeesh7 · 2025-11-27T13:53:16Z

build-index/tagserver/server.go

-	for i, dep := range deps {
+	for _, dep := range deps {
 		if _, err := s.localOriginClient.Stat(tag, dep); err == blobclient.ErrBlobNotFound {
-			log.With("tag", tag, "digest", d.String(), "missing_dependency", dep.String(), "dependency_index", i).Error("Missing dependency blob")


why are we removing these logs, are they noisy ?

I removed it because the log is anyway written here

if err := s.putTag(tag, d, deps); err != nil { log.With("tag", tag, "digest", d.String(), "error", err).Error("Failed to put tag") return err }

So it will be duplicated

gkeesh7

Approved with some comments, I want to assess how much improvements do we get with retries. Please add some observability around number of manifest downloads which succeed with additional retries, as a follow up.

gkeesh7 · 2025-11-27T13:53:23Z

build-index/tagserver/server.go

-			log.With("tag", tag, "digest", d.String(), "missing_dependency", dep.String(), "dependency_index", i).Error("Missing dependency blob")
-			return handler.Errorf("cannot upload tag, missing dependency %s", dep)
+			return fmt.Errorf("cannot upload tag, missing dependency %s", dep)
 		} else if err != nil {


why are we removing these logs, are they noisy ?

Same as here #502 (comment)

gkeesh7 · 2025-11-27T14:02:36Z

build-index/tagserver/server.go

The tradeoff with the Write through approach is that you are bottlenecked by how fast proxy can write blobs to origin.

gkeesh7 · 2025-11-27T14:07:23Z

build-index/tagtype/map.go

-			sr = &subResolver{re, &dockerResolver{originClient}}
+			backoffConfig := httputil.ExponentialBackOffConfig{
+				Enabled:             true,
+				InitialInterval:     500 * time.Millisecond,


Can we export these magic numbers to constants instead ?

It would be better to supply them from configuration file as well. Filing an Issue for this as well #506

In general yes, but seems it is a very rare case, most of the pushes success without retries. I think we need some general policy of retries to dependencies and analyse all the places where it is worth doing this, then reuse config across all the calls.

gkeesh7 · 2025-11-27T14:27:37Z

build-index/tagserver/server.go

I would like to see how retries impact the existing failures, I have filed a follow up Issue here #505 in order to investigate the Write through behaviour.

gkeesh7 · 2025-11-27T14:40:29Z

build-index/tagtype/docker_resolver.go

+		return err
+	}
+
+	if err := backoff.Retry(retryFunc, r.backoffConfig.Build()); err != nil {


Will the caller timeout/hang up by the time we are doing retries ? Can you comment on what timeouts we have from the caller side while downloading the manifests ?

Indeed build-index can timeout before the retries, increased the timeout for the build-index

gkeesh7 · 2025-11-27T15:03:14Z

build-index/tagtype/docker_resolver_test.go

+	require := require.New(t)
+	ctrl := gomock.NewController(t)
+	defer ctrl.Finish()
+
+	resolver, mockOrigin := newTestDockerResolver(ctrl)
+
+	tag := "repo/image:v1.0"
+	layers := core.DigestListFixture(3)
+	manifest, manifestBytes := dockerutil.ManifestFixture(layers[0], layers[1], layers[2])


These lines are repeated over and over, try exporting them to a helper/setup function.

gkeesh7 · 2025-11-27T15:21:34Z

#507 Follow up for observability.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

build-index/tagtype/docker_resolver.go

build-index/tagtype/docker_resolver_test.go

build-index/tagtype/map.go

fix(build-index): Add retry logic for manifest download during depend…

f95fb01

…ency resolution

Copilot AI review requested due to automatic review settings November 24, 2025 19:49

github-actions bot added the size/m label Nov 24, 2025

Copilot started reviewing on behalf of hweawer November 24, 2025 19:49 View session

Copilot finished reviewing on behalf of hweawer November 24, 2025 19:52

Copilot AI reviewed Nov 24, 2025

View reviewed changes

Fix

b5d119a

Pennywise007 reviewed Nov 25, 2025

View reviewed changes

build-index/tagserver/server.go Outdated Show resolved Hide resolved

Anton-Kalpakchiev reviewed Nov 25, 2025

View reviewed changes

Fix

5678a28

Copilot AI review requested due to automatic review settings November 25, 2025 16:47

Copilot started reviewing on behalf of hweawer November 25, 2025 16:47 View session

Copilot finished reviewing on behalf of hweawer November 25, 2025 16:50

Copilot AI reviewed Nov 25, 2025

View reviewed changes

hweawer added 2 commits November 25, 2025 17:59

Fix

7f0f011

Revert log

463b7fa

Copilot AI review requested due to automatic review settings November 25, 2025 17:05

Copilot started reviewing on behalf of hweawer November 25, 2025 17:06 View session

Copilot finished reviewing on behalf of hweawer November 25, 2025 17:09

Copilot AI reviewed Nov 25, 2025

View reviewed changes

build-index/tagtype/docker_resolver.go Show resolved Hide resolved

hweawer requested a review from Anton-Kalpakchiev November 25, 2025 20:49

Test retries

8e7dd8d

hweawer requested a review from Pennywise007 November 26, 2025 14:48

gkeesh7 reviewed Nov 27, 2025

View reviewed changes

This was referenced Nov 27, 2025

Change Proxy cache to Write through instead of Write Back #505

Open

Make the build index backoff configuration, config driven #506

Open

gkeesh7 approved these changes Nov 27, 2025

View reviewed changes

Increase timeout, fix

ae1ed75

Copilot AI review requested due to automatic review settings November 28, 2025 09:54

Copilot started reviewing on behalf of hweawer November 28, 2025 09:54 View session

Copilot finished reviewing on behalf of hweawer November 28, 2025 09:57

Copilot AI reviewed Nov 28, 2025

View reviewed changes

build-index/tagtype/docker_resolver.go Show resolved Hide resolved

build-index/tagtype/docker_resolver_test.go Show resolved Hide resolved

build-index/tagtype/map.go Show resolved Hide resolved

hweawer merged commit 9213176 into master Nov 28, 2025
16 checks passed

fix(build-index): Add retry logic for manifest download during dependency resolution #502

fix(build-index): Add retry logic for manifest download during dependency resolution #502

Conversation

hweawer commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Description

Problem

Root Cause

Solution

Changes Made

1. build-index/tagtype/docker_resolver.go - Add Retry Logic

2. build-index/tagtype/map.go - Configure Backoff Parameters

Impact

Before

After

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gkeesh7 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hweawer Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

hweawer commented Nov 24, 2025 •

edited

Loading

1. `build-index/tagtype/docker_resolver.go` - Add Retry Logic

2. `build-index/tagtype/map.go` - Configure Backoff Parameters

hweawer Nov 27, 2025 •

edited

Loading