Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fix: retry middleware corrupts exponential backoff state#690

Open
Abzaek wants to merge 1 commit into
ThreeDotsLabs:masterfrom
Abzaek:fix/retry-backoff-corruption
Open

fix: retry middleware corrupts exponential backoff state#690
Abzaek wants to merge 1 commit into
ThreeDotsLabs:masterfrom
Abzaek:fix/retry-backoff-corruption

Conversation

@Abzaek
Copy link
Copy Markdown

@Abzaek Abzaek commented May 17, 2026

Bug

The Retry middleware in message/router/middleware/retry.go corrupts the exponential backoff state machine by calling expBackoff.NextBackOff() inside the operation function.

Root Cause

The operation function passed to backoff.Retry() was calling expBackoff.NextBackOff() for two purposes:

  1. To get the delay value for the ShouldRetry callback
  2. To get the delay value for the OnRetryHook callback

Each call to NextBackOff()advances the backoff's internal state. When backoff.Retry() itself then calls NextBackOff() to determine the actual sleep duration, it receives a future interval.

Impact

Retry delays were double what they should be. With InitialInterval=10ms, Multiplier=2.0:

  • Expected: 10ms, 20ms, 40ms, 80ms
  • Actual (buggy): 20ms, 40ms, 80ms, 160ms

Fix

Replaced NextBackOff() calls with a new pure function computeBackoffDelay_atRetryNum() that computes the approximate delay from the config parameters without mutating any state. This lets backoff.Retry() continue to drive actual sleep durations correctly.

Test Results

All existing tests pass (go test ./... — all green).

Fixes #224 (partial)

… in operation function

The Retry middleware's operation function called expBackoff.NextBackOff()
inside ShouldRetry and OnRetryHook callbacks, which advanced the
exponential backoff's internal state machine. When backoff.Retry then
called NextBackOff() to determine the actual sleep duration, it received
a future interval, effectively skipping one or more intermediate delays.

For example, with InitialInterval=10ms, Multiplier=2.0, the sequence
of actual sleep durations was: 20ms, 40ms, 80ms, 160ms instead of
the correct: 10ms, 20ms, 40ms, 80ms.

The fix computes the approximate delay from the retry configuration
without mutating any backoff state, preserving the correct exponential
backoff sequence used by the backoff.Retry framework.

Signed-off-by: Abdulazez A. <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[watermill-sql] Nacked message stays within one consumer

1 participant