Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@AlliBalliBaba
Copy link
Contributor

@AlliBalliBaba AlliBalliBaba commented Nov 11, 2025

suggestion for #1955, this is what a solution without backoff would look like.

Startup failures will immediately return an error on Init().

image

threadworker.go Outdated
}

// wait a bit and try again
time.Sleep(time.Millisecond * 250)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it better to use an exponential back off strategy here? 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, I'll add a minimal version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, I'm not sure if having an exponential wait backoff will help with either script failures when watching or external resource failures. In both cases the time-to-resolution would probably be in the range of seconds.

I'm not against keeping it though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For external services, exponential backoff prevent these services from being flooded with requests when they are up again.

Sometimes (see recent AWS issues), these issues take hours to be fixed.

@henderkes
Copy link
Contributor

I think I like this better. Lower complexity and the failure case (frankenphp_handle-request not reached) is extremely unlikely to be solved by trying again.

@dunglas
Copy link
Member

dunglas commented Nov 12, 2025

I wonder if it's really a good idea to crash the server because a single worker script fails at startup. App booting may fail because an external service is down, for instance, and crashing the server just for that is a bit too much IMHO.

Also, apps may run dozens of different worker scripts, preventing the whole server from starting because one is crashing (for instance, because a remote API is down) could be unwanted.

@AlliBalliBaba
Copy link
Contributor Author

This PR doesn't change the logic with startup failures, it will just fail immediately. There are not many alternatives, workers have to get to a 'ready' state, otherwise the server cannot start accepting requests.

@AlliBalliBaba
Copy link
Contributor Author

AlliBalliBaba commented Nov 12, 2025

Just failing immediately is the easiest solution from our side IMO. Users can always configure some kind of process/container supervision if they expect random startup failure on deployments (which they should do anyways).

If an external service that is needed for startup is down, then the expected behavior should be to fail startup.

That being sad, it would definitely be possible to start the server in a half-broken state where some workers might be failing. Maybe by marking some workers as 'essential' and others as 'non essential'. Or by adding a frankenphp_set_ready() function. That's something that goes beyond this PR though.

@henderkes
Copy link
Contributor

Just failing immediately is the easiest solution from our side IMO. Users can always configure some kind of process/container supervision if they expect random startup failure on deployments (which they should do anyways).

I agree. Not to mention, nobody should be connecting to (possibly failing) outside resources in their worker startup script.

That being sad, it would definitely be possible to start the server in a half-broken state where some workers might be failing. Maybe by marking some workers as 'essential' and others as 'non essential'. Or by adding a frankenphp_set_ready() function.

I don't think we even need that extra complexity. Startup scripts should handle that on their own, we don't need an extra method for it.

@AlliBalliBaba
Copy link
Contributor Author

True, if an external resource is allowed to fail, the application should actually handle that itself.

@dunglas
Copy link
Member

dunglas commented Nov 12, 2025

@henderkes I've already seen a lot of apps retrieving secrets from HashiCorp Vault, config from etcd, cached data from Redis, feature-flags from SaaS like Unleash or translations from Lokalize when booting.

I think that it's pretty common, and, while failures can (and should) be handled user-land, it would be nice to be as convenient as possible if a service like that is down.

For instance, the non-worker mode will not hard-fail if something like that happens. It will just return an error (likely just a 500) until the service is up again.

IMHO, it would be nice to have a similar behavior when using the worker mode.

@henderkes
Copy link
Contributor

IMHO, it would be nice to have a similar behavior when using the worker mode.

I agree, but I don't see how it would be possible on our side. Should we automatically mark workers as ready even though they've never reached frankenphp_handle_request, or should we mark them as inactive, not ready to pass requests to, while periodically retrying their initial bootup?

FrankenPHP's current behaviour is to fail if a worker fails too often on startup. I can see the value in getting secrets and stuff once on worker bootup, but then they couldn't serve their sites in regular mode at all. So they'd be best advised to retry those calls in the request handling and not rely on them solely on worker boot.

@henderkes
Copy link
Contributor

I suppose what we could do is retry booting the worker script on requests, meaning FrankenPHP would stay up, even though worker scripts have failed. But then we're not giving users any indication what's wrong until they look in the logs.

@AlliBalliBaba
Copy link
Contributor Author

@henderkes I've already seen a lot of apps retrieving secrets from HashiCorp Vault, config from etcd, cached data from Redis, feature-flags from SaaS like Unleash or translations from Lokalize when booting

I guess there's an argument to be made if you spam these services with 50 workers on startup 😅 . Alright, we can keep the backoff.

I'll change it so we'll keep the error instead of panicking since that's actually testable.

Comment on lines +161 to +166
backoffDuration := time.Duration(handler.failureCount*handler.failureCount*100) * time.Millisecond
if backoffDuration > time.Second {
backoffDuration = time.Second
}
handler.failureCount++
time.Sleep(backoffDuration)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual backoff logic is just 6 lines of code, so probably no module or library necessary

@AlliBalliBaba
Copy link
Contributor Author

Maybe there's merit in revisiting the failure logic at some point. This branch only makes the failure testable instead of panicking, so I'll merge it into #1955 for now.

@AlliBalliBaba AlliBalliBaba merged commit a36547b into refator/cleanup-c Nov 13, 2025
@AlliBalliBaba AlliBalliBaba deleted the refactor/remove-exponential-backoff branch November 13, 2025 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants