Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[filebeat][streaming] - Improved websocket exponential backoff logic to produce a smoother backoff curve#44069

Merged
ShourieG merged 14 commits into
elastic:mainfrom
ShourieG:websocket/enhancement_backoff
May 26, 2025
Merged

[filebeat][streaming] - Improved websocket exponential backoff logic to produce a smoother backoff curve#44069
ShourieG merged 14 commits into
elastic:mainfrom
ShourieG:websocket/enhancement_backoff

Conversation

@ShourieG
Copy link
Copy Markdown
Contributor

@ShourieG ShourieG commented Apr 25, 2025

Type of change

  • Enhancement

Proposed commit message

The previous waitTime calculation for the exponential backoff strategy
produced an extremely sharp curve, where depending on the values of waitMin and
waitMax and number of attempts, the waitMax (cap) would be easily reached
after the initial couple of attempts thereby limiting the growth in the wait
time compared to the total number of attempts. Simply the waitTime growth
curve would hit the cap and flatten out after 1-2 retry attempts because of
uncapped jitter. This new change makes it such the waitTime growth curve
increases more smoothly with the number of attempts providing a smoother backoff
function.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Existing Backoff Curve:

graph_before

New Backoff Curve:

new_graph_after

Logs

@ShourieG ShourieG self-assigned this Apr 25, 2025
@ShourieG ShourieG requested a review from a team as a code owner April 25, 2025 10:43
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

@botelastic botelastic Bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Apr 25, 2025
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 25, 2025

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @ShourieG? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@ShourieG ShourieG requested a review from efd6 April 28, 2025 03:14
Copy link
Copy Markdown
Contributor

@efd6 efd6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems ok, but is it solving a non-aesthetics problem?

Comment thread x-pack/filebeat/input/streaming/websocket.go Outdated
@ShourieG
Copy link
Copy Markdown
Contributor Author

This seems ok, but is it solving a non-aesthetics problem?

So this is purely an optimisation to the existing growth curve that was present. The existing logic is functional and works well but I personally find that it's a bit too harsh in the way it reaches the wait max cap after just 1-2 attempts. The uncapped jitter is the issue. I think i can make the jitter grow more steadily instead of the calculation being done right now. Would that align more with what you would expect ?

@efd6
Copy link
Copy Markdown
Contributor

efd6 commented Apr 28, 2025

So this is purely an optimisation to the existing growth curve that was present.

OK, so I'm wondering what the objective function for the optimisation is.

However, if you think it's worth it, this is a sound function that does essentially what you want. https://go.dev/play/p/IegHOG0qsyX

func wait(min, max time.Duration, spread float64, i, n int) time.Duration {
	if min >= max {
		panic("nope")
	}
	l := logistic(i, n)
	return min + time.Duration(float64(max-min)*(l+spread*jitter(l)))
}

func logistic(i, n int) float64 {
	return 1 / (1 + math.Exp(float64(n)/2-float64(i)))
}

func jitter(f float64) float64 {
	return (rand.Float64() - 0.5) * f * (1 - f)
}

If spread is greater than one, then clamping will be required.

This is the basis curve, jitter uses co-proportional scaling to keep it sensible.

Spread is the jitter spread, n is the number of iterations to reach max and i is the iter we're on. The rest should be self-explanatory.

@ShourieG
Copy link
Copy Markdown
Contributor Author

@efd6, So just for clarity the objective function is for the wait time to increase more steadily with the number of attempts rather than hitting the cap (wait max) after like 1-2 retry attempts that's happening right now.

@efd6
Copy link
Copy Markdown
Contributor

efd6 commented Apr 28, 2025

Yeah, got that. I'm just wondering why. I'm happy to have nice maths here, but I don't think the machines care.

@ShourieG
Copy link
Copy Markdown
Contributor Author

ShourieG commented Apr 28, 2025

Yeah, got that. I'm just wondering why. I'm happy to have nice maths here, but I don't think the machines care.

Yea just a personal preference here, the current behaviour was just rubbing me the wrong way.

@efd6
Copy link
Copy Markdown
Contributor

efd6 commented Apr 28, 2025

That's what I thought. I'm fine with that, are you going to use the logistic backoff I posted or keep what you have. If you want to keep what you have, I'll need to look at it again tomorrow.

@ShourieG
Copy link
Copy Markdown
Contributor Author

ShourieG commented Apr 28, 2025

That's what I thought. I'm fine with that, are you going to use the logistic backoff I posted or keep what you have. If you want to keep what you have, I'll need to look at it again tomorrow.

I need to test out some scenarios with the code you posted and check the growth curve it's producing with some real world values (I like the base curve the you linked), at the same time I wanted to keep the math simple so it is easy to read and decipher. Will make some sort of decision on this soon. There's no rush here, so I'll finish off a couple of other pending tasks and come back to this.

@efd6
Copy link
Copy Markdown
Contributor

efd6 commented Apr 28, 2025

at the same time I wanted to keep the math simple so it is easy to read and decipher

The basis for the code is well known (two features: logistic growth and that x(1-x) is under x and (1-x) for the domain). The math.Exp is almost certainly cheaper than math.Pow since Pow calls Exp. If you want I can write a comment for it.

@ShourieG
Copy link
Copy Markdown
Contributor Author

Yea comments would be great.

@efd6
Copy link
Copy Markdown
Contributor

efd6 commented Apr 29, 2025

// wait returns a logistic backoff duration with jitter. The duration increases
// from min to max in n steps, with i indicating the step. Jitter is added around
// the logistic based on the value of spread. Zero spread results in no jitter,
// and unit spread is maximal. Spread values above one may result in durations
// outside [min, max]. min must not be greater than max.
func wait(min, max time.Duration, i, n int, spread float64) time.Duration {
	if min >= max {
		panic("nope")
	}
	l := logistic(i, n-1) // n-1 because of fence posts.
	return min + time.Duration(float64(max-min)*(l+spread*jitter(l)))
}

// logistic returns the ith value of n of the logistic function shifted
// n/2 right. The returned value is in (0, 1) for all sensible values.
func logistic(i, n int) float64 {
	return 1 / (1 + math.Exp(float64(n)/2-float64(i)))
}

// jitter returns a jittered value around f, f±eps, where eps is f(1-f).
// f must be in [0, 1].
func jitter(f float64) float64 {
	return (rand.Float64() - 0.5) * f * (1 - f)
}

@ShourieG
Copy link
Copy Markdown
Contributor Author

@efd6, I've updated the backoff logic with your suggestions

Comment thread x-pack/filebeat/input/streaming/websocket.go Outdated
@ShourieG
Copy link
Copy Markdown
Contributor Author

@efd6, updated the spread to 1.0 and this seems good. Thanks for helping out on this, was very nice to get a deeper mathematical perspective.

@ShourieG
Copy link
Copy Markdown
Contributor Author

/test

@ShourieG ShourieG merged commit a684f00 into elastic:main May 26, 2025
29 of 33 checks passed
@ShourieG ShourieG deleted the websocket/enhancement_backoff branch May 26, 2025 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants