-
Notifications
You must be signed in to change notification settings - Fork 423
suggestion: without exponential backoff #1970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
suggestion: without exponential backoff #1970
Conversation
threadworker.go
Outdated
| } | ||
|
|
||
| // wait a bit and try again | ||
| time.Sleep(time.Millisecond * 250) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it better to use an exponential back off strategy here? 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, I'll add a minimal version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it, I'm not sure if having an exponential wait backoff will help with either script failures when watching or external resource failures. In both cases the time-to-resolution would probably be in the range of seconds.
I'm not against keeping it though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For external services, exponential backoff prevent these services from being flooded with requests when they are up again.
Sometimes (see recent AWS issues), these issues take hours to be fixed.
|
I think I like this better. Lower complexity and the failure case (frankenphp_handle-request not reached) is extremely unlikely to be solved by trying again. |
|
I wonder if it's really a good idea to crash the server because a single worker script fails at startup. App booting may fail because an external service is down, for instance, and crashing the server just for that is a bit too much IMHO. Also, apps may run dozens of different worker scripts, preventing the whole server from starting because one is crashing (for instance, because a remote API is down) could be unwanted. |
|
This PR doesn't change the logic with startup failures, it will just fail immediately. There are not many alternatives, workers have to get to a 'ready' state, otherwise the server cannot start accepting requests. |
|
Just failing immediately is the easiest solution from our side IMO. Users can always configure some kind of process/container supervision if they expect random startup failure on deployments (which they should do anyways). If an external service that is needed for startup is down, then the expected behavior should be to fail startup. That being sad, it would definitely be possible to start the server in a half-broken state where some workers might be failing. Maybe by marking some workers as 'essential' and others as 'non essential'. Or by adding a |
I agree. Not to mention, nobody should be connecting to (possibly failing) outside resources in their worker startup script.
I don't think we even need that extra complexity. Startup scripts should handle that on their own, we don't need an extra method for it. |
|
True, if an external resource is allowed to fail, the application should actually handle that itself. |
|
@henderkes I've already seen a lot of apps retrieving secrets from HashiCorp Vault, config from etcd, cached data from Redis, feature-flags from SaaS like Unleash or translations from Lokalize when booting. I think that it's pretty common, and, while failures can (and should) be handled user-land, it would be nice to be as convenient as possible if a service like that is down. For instance, the non-worker mode will not hard-fail if something like that happens. It will just return an error (likely just a 500) until the service is up again. IMHO, it would be nice to have a similar behavior when using the worker mode. |
I agree, but I don't see how it would be possible on our side. Should we automatically mark workers as ready even though they've never reached FrankenPHP's current behaviour is to fail if a worker fails too often on startup. I can see the value in getting secrets and stuff once on worker bootup, but then they couldn't serve their sites in regular mode at all. So they'd be best advised to retry those calls in the request handling and not rely on them solely on worker boot. |
|
I suppose what we could do is retry booting the worker script on requests, meaning FrankenPHP would stay up, even though worker scripts have failed. But then we're not giving users any indication what's wrong until they look in the logs. |
I guess there's an argument to be made if you spam these services with 50 workers on startup 😅 . Alright, we can keep the backoff. I'll change it so we'll keep the error instead of panicking since that's actually testable. |
| backoffDuration := time.Duration(handler.failureCount*handler.failureCount*100) * time.Millisecond | ||
| if backoffDuration > time.Second { | ||
| backoffDuration = time.Second | ||
| } | ||
| handler.failureCount++ | ||
| time.Sleep(backoffDuration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual backoff logic is just 6 lines of code, so probably no module or library necessary
|
Maybe there's merit in revisiting the failure logic at some point. This branch only makes the failure testable instead of panicking, so I'll merge it into #1955 for now. |
suggestion for #1955, this is what a solution without backoff would look like.
Startup failures will immediately return an error on
Init().