-
Notifications
You must be signed in to change notification settings - Fork 8.1k
Enable graceful HTTP shutdown and document default behavior #45833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable graceful HTTP shutdown and document default behavior #45833
Conversation
95775e2 to
9de4232
Compare
ahus1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR, looks great! Please see below for a note about using Quarkus properties in the docs.
docs/guides/server/reverseproxy.adoc
Outdated
| [source,properties] | ||
| ---- | ||
| quarkus.shutdown.delay-enabled=true | ||
| quarkus.shutdown.delay=1s | ||
| quarkus.shutdown.timeout=1s | ||
| ---- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ruchikajha95 / @ryanemerson / @vmuzikar - some time ago we discussed that setting Quarkus options in Keycloak config is not supported, and therefore we restrained ourselves from naming any Quarkus properties in our docs.
So I think we have two routes here:
- We think it would never needs to be changed by a user -> then we remove it from the docs
- We think this is something that is supported to be changed -> then we change it to an SPI option or a CLI option. CLI option is something that we decided against in our team, so this would leave the SPI.
Please comment and share your thoughts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me and @vmuzikar discussed the use of an SPI option on slack and the conclusion was that it wasn't really applicable here, as we're not actually configuring a Keycloak SPI.
some time ago we discussed that setting Quarkus options in Keycloak config is not supported, and therefore we restrained ourselves from naming any Quarkus properties in our docs.
As a third route, can we be pragmatic and reconsider this on a case-by-case basis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we're not actually configuring a Keycloak SPI
That hasn't stopped us in other locations from using the spi configuration mechanism for other things - for example allowed-system-variables.
As a third route, can we be pragmatic and reconsider this on a case-by-case basis?
We're going to probably be in a similar situation with all the quarkus orm properties - having all of those as first-class / cli options seems like burden. If there's something we can agree on that's more pragmatic, we'll likely reuse that approach.
I'm mostly fine with the idea of differentiating between supported (quarkus.shutdown., quarkus.hibernate-orm.) and unsupported quarkus options - then allow those options to be present in either the quarkus.properties or properly utilized from the ENV (relates to the comment below about source ordinality).
We can always come back later and add proper keycloak options as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The discussion I remembered was rotating around community and customers trying to achieve things with Quarkus parameters that we considered unsupported. And it was difficult for them to figure out what was supported and what not.
In the beginning we also thought of adding an Keycloak CLI/SPI as an abstraction to what Quarkus is doing to adjust the behavior as needed and smooth migrations.
I think that is owned by the CND team, so if you want to decide this differently, then ok. Still I think the situation hasn't changed - people will still be confused what is allowed and what not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any valid configuration that users of Keycloak should be able to do should be proper Keycloak options, and documented.
That applies to this PR and ORM settings, and anything else.
The only mention we should have on Quarkus options is in https://www.keycloak.org/server/configuration#_format_for_raw_quarkus_properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 for an SPI option.
SPI options are meant for configuring providers (first- or third-party). This is not an SPI we're configuring here rather than Quarkus behavior. From the support perspective, it does not matter whether it's full blown CLI option or an SPI option – as long as it's documented, it's supported (unless explicitly told otherwise).
The question here should be whether this should be just an escape hatch for dev/debugging purposes, or something documented and supported. If the first, I'd vote for CLI options. If the latter, we can stick to Quarkus options but we must not document it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahus1 Had some examples where users might want to tweak the delay/timeout, can you share these here with the wider group so that we can try to determine how niche this config actually is?
One approach could be to set the default quarkus property values in this PR and then wait for user feedback to see if this tweaking is actually required by (m)any users before deciding whether to promote these to CLI options or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say the 1 second pre-shutdown and 1 second shutdown phase are the maximum delays we could add to not call it a non breaking change.
When we run our test suite, we want to set the pre-shutdown phase to 0 seconds to not slow down our test suite.
I'd say users would want to set it up with longer periods depending on the their proxy configuration (edge/reencrypt vs. passthrough) and how the proxy gets the information about the shutdown (at the same time as the Pod, well before the pod, or the proxy polls it from the Pod).
The longer version:
The 1 second/1 second config might be the right thing when you run in Kubernetes where the load balancer is re-configured at the same time as the Pod gets the termination signal: It takes about a second (as @slaskawi described) for the loadbalancer to finish reconfiguration. If the reverse proxy is edge or reencrypt, the next requests are routed to the remaining pods. So a shutdown period is sufficient for the running requests to finish, and KC might finish early when there are no more running requests.
Regular login requests should rarely take longer than a second. An admin where they are different longer running requests for example on the admin API might play it safe and wants to set the shutdown period to 10 seconds.
In proxy-setup with a a TLS passthrough, the connection between the client and the proxy is still established after the proxy reconfiguration, and requests are still sent to the Pod to-be-terminated. It would be good to wait for the connections to drain from the Pod: With the current Quarkus 3/Vert.x 4 setup, the client only receives a HTTP/1.1 connection close or a HTTP/2 GOAWAY when the client sends the next request. So the longer the shutdown period, the higher the probability of the client sending a request, and therefore closing the connection. The longer the connection has been idle, the better the chance to just close it on shutdown and no request currently incoming, or the client closing it voluntarily. In such a setup, I'd say 10 seconds would be OK, 20-30 seconds would be good. There is still a slight probability to lose some requests. It will be better with Quarkus 4, Vert.x 5 and HTTP/2 as that will send a GOAWAY to those clients out-of-order when pre-shutdown starts, which is preferred (but unfortunately not supported in earlier versions).
In setups where the deployment procedure instructs the proxy ahead of killing the Pod, there is no need to have a pre-shutdown delay. It's quite the opposite: the admin would like to see a pre-shutdown period of 0.
In setups where the proxy polls the information which Pod should be in the load balancing by polling the readiness probe, it takes 1-2 poll cycles to the loadbalancer to recognize to remove the Pod. Assuming a poll interval of 5 seconds, and two poll cycles, the pre-shutdown period needs to be 10 seconds.
In the worst case you would have a proxy that polls the status from the Keycloak Pods, and TLS passthrough, and then those two delays add up. Given the examples above, you would have a 10 + 30 = 40 seconds shutdown period.
Looking at this wall of text, should we simplify this in a decision table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this wall of text, should we simplify this in a decision table?
If we do add this config, at least we have half the docs written already 😄
Thanks for the detailed explanation @ahus1
My 2c is that this configuration still feels very niche, we're improving our defaults compared to prior behaviour and things should improve again once we are able to upgrade to Quarkus 4.
It's easy to add many CLI options, but we know we can't easily remove said options except for in Major releases. Adding additional configuration toggles complicates our documentation and adds one more thing for users to consider, potentially causing confusion, when in most cases they shouldn't care.
Unless we have a body of existing issues related to such proxy setups and related issues, I think we should hold off adding the CLI options until we know this is a problem affecting (and causing bad experience) users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say the only part that might be non-common is "An admin where they are different longer running requests for example on the admin API might play it safe and wants to set the shutdown period to 10 seconds."
All other scenarios of draining connections are very common, and also present in our blueprints as they use TLS passthrough. And our blueprint is actually over-simplified (or even invalid depending on your perspective) as it doesn't log client IP addresses, see keycloak/keycloak-benchmark#910, which is usually solved by a different proxy configuration. We've been ignoring this for a while, and people might consider that an security auditing problem.
Once you look outside of Kubernetes, all those ways to configure proxy load distribution are equally common.
docs/guides/server/reverseproxy.adoc
Outdated
| These values can be set in application.properties file. | ||
|
|
||
| NOTE: When Quarkus properties are defined directly in application.properties, environment variables may not override them. | ||
| This is general Quarkus configuration behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| These values can be set in application.properties file. | |
| NOTE: When Quarkus properties are defined directly in application.properties, environment variables may not override them. | |
| This is general Quarkus configuration behavior. | |
| These values can be set in the quarkus.properties file. | |
| NOTE: When Quarkus properties are defined directly in quarkus.properties, environment variables may not override them. |
We should revisit whether we want to properly document that environment variables can be used for quarkus properties - certainly users are already doing that - and adjust the source ordinals accordingly.
ahus1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this pull request, see below for some changes needed to the docs.
docs/guides/server/reverseproxy.adoc
Outdated
|
|
||
| [source,properties] | ||
| ---- | ||
| quarkus.shutdown.delay-enabled=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ruchikajha95 - for those things that people would reconfigure: For now I would consider that they change the delay and the timeout, but they wouldn't change the delay-enabled. delay-enabled is also a build-time options, which would need additional explanation.
So I suggest to remove "delay-enabled" here.
docs/guides/server/reverseproxy.adoc
Outdated
| == Graceful HTTP shutdown | ||
|
|
||
| When running {project_name} behind a reverse proxy or load balancer , it is important to allow in-flight requests to complete during server shutdown. | ||
|
|
||
| {project_name} enables graceful HTTP shutdown by default using Quarkus runtime configuration. | ||
|
|
||
| === Default behavior | ||
|
|
||
| By default {project_name} configures Quarkus with short pre-shutdown delay and a bounded shutdown timeout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this first set of docs. Please add the following information:
- Explain the concepts: There is a pre-shutdown and a shutdown period. Explain what they are from the perspective of an administrator of Keycloak, and how Keycloak behaves in each period on a reasonably high level addressing a Keycloak admin. When you do, include the new additional readiness probe that marks the service "down" already during the pre-shutdown period. You can align the wording with the upstream Quarkus docs, but we wouldn't link to those docs as they don't take the Keycloak perspective, and our docs should be self-contained. Actually don't even mention Quarkus, as a Keycloak admin should not care about the fact that Keycloak runs Quarkus underneath.
- Describe the default behavior in plain English, not by listing Quarkus properties.
- When you describe how to configure the behavior, state in which file people would need to add those properties
- Also list the matching environment variables that people can use - those would be named AFAIK QUARKUS_SHUTDOWN_DELAY and QUARKUS_SHUTDOWN_TIMEOUT. People on Kubernetes usually prefer environment variables
Please double-check with @ryanemerson how this can be configured via the Keycloak CR. I have the suspicion that Quarkus properties might not be available from the Keycloak CR.
stianst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should NOT document the use of Quarkus properties. They are not supported.
If we believe these are options worth documenting they should be turned into Keycloak properties.
|
Hi @ahus1 , I have a keycloak cluster running 3 replicas, I did try draining the connection for graceful shutdown of the ongoing requests with the following parameters: even with those configured, the keycloak replica currently draining (let's say the keycloak no. 1) continues processing job and requests even with traffic not routed through its endpoint. How is that possible? Is it because the jobs and requests are assigned via the ispn cache? I also want to know if this PR is addressing this issues or it's something else. Thank you |
|
@wcote-kz - please describe your setup so we have more context.
Note that job processing is out-of-scope for this, this is only about processing incoming HTTP requests. The current release of Quarkus is also not very good in draining connections. |
|
@ahus1 Thank you for your reply, here's the setup I tested (3 replicas of keycloak in a k8s cluster): QUARKUS_SHUTDOWN_DELAY_ENABLED: "true"
QUARKUS_SHUTDOWN_DELAY: "300" # 5 min delay
QUARKUS_SHUTDOWN_TIMEOUT: "600" # 10 min timeoutI observed that keycloak/quarkus is correctly following the above shutdown schedule when performing a statefulset restart (to emulate a rolling upgrade). So my conclusion is that:
My loadbalancer in k8s is traefik with tls termination and I have the same behavior with nginx. Thank you. |
|
@wcote-kz - the current Quarkus setup is not very good in connection draining:
I would have hoped that nginx and traefik would no longer route requests ... can you see that they reconfigure once the Pod is about to shut down? |
|
@ahus1 Yes I can see that the ingress controller (traefik or nginx) that points to the keycloak k8s svc is not routing traffic, since I can see the endpoints list goes from 3 keycloak IPs:8080 to 2 keycloak IPs:8080 when one is reporting not ready and in the shutdown period. So I'm confident that the networking part is correct. I just haven't figured out yet why when the keycloak actually gets stopped I get the issues mentioned in my other comment even if no traffic is going though it technically. That's why I was thinking maybe there's some sort of "job/request autobalancing" from keycloak to keycloak via the ispn cache that is done after an online keycloak receives a request that could send it to a "stopping" keycloak in the cluster. With that said I don't know that this PR or the next Quarkus version are addressing that behavior. Thank you for your reply |
|
Add two more CLI options, as we require them for proxy configurations. We ruled out to not have SPI options, as this is not about SPIs: Add documentation stating that we are handling HTTP at the moment, and explain how they work, and that the functionality will change in the future. |
+1 for those two. |
...me/src/main/java/org/keycloak/quarkus/runtime/configuration/mappers/HttpPropertyMappers.java
Show resolved
Hide resolved
Closes keycloak#45833 Signed-off-by: Ruchika <[email protected]>
|
Thanks @pruivo for the review . I have made the changes . |
Closes keycloak#43589 Signed-off-by: Ruchika <[email protected]>
…r the PR review comments Closes keycloak#43589 Signed-off-by: Ruchika <[email protected]>
Closes keycloak#45833 Signed-off-by: Ruchika <[email protected]>
7a93ae5 to
6ea36b1
Compare
...me/src/main/java/org/keycloak/quarkus/runtime/configuration/mappers/HttpPropertyMappers.java
Show resolved
Hide resolved
Signed-off-by: Alexander Schwartz <[email protected]>
|
@ruchikajha95, the failed test keycloak/quarkus/tests/integration/src/test/java/org/keycloak/it/cli/dist/HealthDistTest.java Line 86 in a2c1055
|
|
@ruchikajha95 , @pruivo - I'll push a change in a minute, I was just now reviewing it. |
Signed-off-by: Alexander Schwartz <[email protected]>
|
@ruchikajha95 / @pruivo - Thank you for the updated PR, it looks good to me. I've updated the docs around the feature, see 89a0311:
Please review my latest change if I mixed something up. If all is good, this should be good for merging. |
Signed-off-by: Alexander Schwartz <[email protected]>
Signed-off-by: Alexander Schwartz <[email protected]>
|
@ruchikajha95 - can you please add a test for the newly added CLI parameters to |
|
@ahus1, what tests do you have in mind? |
Sorry for not being specific. I'd like to see a test that the CLI options are accepted or return any errors. Additional tests around the shutdown functionality are probably out of scope and difficult to test. Unless you have ideas for that... |
Closes keycloak#45381 Signed-off-by: Ruchika <[email protected]>
Signed-off-by: Alexander Schwartz <[email protected]>
ahus1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this change, @ruchikajha95, and everyone who helped reviewing and contributed!
Closes #43589