Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

Conversation

@ruchikajha95
Copy link
Contributor

Closes #43589

@ruchikajha95 ruchikajha95 force-pushed the feature-43589/HTTP_graceful_shutdown branch from 95775e2 to 9de4232 Compare January 28, 2026 14:41
Copy link
Member

@ahus1 ahus1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR, looks great! Please see below for a note about using Quarkus properties in the docs.

Comment on lines 292 to 297
[source,properties]
----
quarkus.shutdown.delay-enabled=true
quarkus.shutdown.delay=1s
quarkus.shutdown.timeout=1s
----
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruchikajha95 / @ryanemerson / @vmuzikar - some time ago we discussed that setting Quarkus options in Keycloak config is not supported, and therefore we restrained ourselves from naming any Quarkus properties in our docs.

So I think we have two routes here:

  • We think it would never needs to be changed by a user -> then we remove it from the docs
  • We think this is something that is supported to be changed -> then we change it to an SPI option or a CLI option. CLI option is something that we decided against in our team, so this would leave the SPI.

Please comment and share your thoughts!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me and @vmuzikar discussed the use of an SPI option on slack and the conclusion was that it wasn't really applicable here, as we're not actually configuring a Keycloak SPI.

some time ago we discussed that setting Quarkus options in Keycloak config is not supported, and therefore we restrained ourselves from naming any Quarkus properties in our docs.

As a third route, can we be pragmatic and reconsider this on a case-by-case basis?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we're not actually configuring a Keycloak SPI

That hasn't stopped us in other locations from using the spi configuration mechanism for other things - for example allowed-system-variables.

As a third route, can we be pragmatic and reconsider this on a case-by-case basis?

We're going to probably be in a similar situation with all the quarkus orm properties - having all of those as first-class / cli options seems like burden. If there's something we can agree on that's more pragmatic, we'll likely reuse that approach.

I'm mostly fine with the idea of differentiating between supported (quarkus.shutdown., quarkus.hibernate-orm.) and unsupported quarkus options - then allow those options to be present in either the quarkus.properties or properly utilized from the ENV (relates to the comment below about source ordinality).

We can always come back later and add proper keycloak options as needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion I remembered was rotating around community and customers trying to achieve things with Quarkus parameters that we considered unsupported. And it was difficult for them to figure out what was supported and what not.

In the beginning we also thought of adding an Keycloak CLI/SPI as an abstraction to what Quarkus is doing to adjust the behavior as needed and smooth migrations.

I think that is owned by the CND team, so if you want to decide this differently, then ok. Still I think the situation hasn't changed - people will still be confused what is allowed and what not.

Copy link
Contributor

@stianst stianst Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any valid configuration that users of Keycloak should be able to do should be proper Keycloak options, and documented.

That applies to this PR and ORM settings, and anything else.

The only mention we should have on Quarkus options is in https://www.keycloak.org/server/configuration#_format_for_raw_quarkus_properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 for an SPI option.

SPI options are meant for configuring providers (first- or third-party). This is not an SPI we're configuring here rather than Quarkus behavior. From the support perspective, it does not matter whether it's full blown CLI option or an SPI option – as long as it's documented, it's supported (unless explicitly told otherwise).

The question here should be whether this should be just an escape hatch for dev/debugging purposes, or something documented and supported. If the first, I'd vote for CLI options. If the latter, we can stick to Quarkus options but we must not document it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahus1 Had some examples where users might want to tweak the delay/timeout, can you share these here with the wider group so that we can try to determine how niche this config actually is?

One approach could be to set the default quarkus property values in this PR and then wait for user feedback to see if this tweaking is actually required by (m)any users before deciding whether to promote these to CLI options or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the 1 second pre-shutdown and 1 second shutdown phase are the maximum delays we could add to not call it a non breaking change.

When we run our test suite, we want to set the pre-shutdown phase to 0 seconds to not slow down our test suite.

I'd say users would want to set it up with longer periods depending on the their proxy configuration (edge/reencrypt vs. passthrough) and how the proxy gets the information about the shutdown (at the same time as the Pod, well before the pod, or the proxy polls it from the Pod).

The longer version:

The 1 second/1 second config might be the right thing when you run in Kubernetes where the load balancer is re-configured at the same time as the Pod gets the termination signal: It takes about a second (as @slaskawi described) for the loadbalancer to finish reconfiguration. If the reverse proxy is edge or reencrypt, the next requests are routed to the remaining pods. So a shutdown period is sufficient for the running requests to finish, and KC might finish early when there are no more running requests.

Regular login requests should rarely take longer than a second. An admin where they are different longer running requests for example on the admin API might play it safe and wants to set the shutdown period to 10 seconds.

In proxy-setup with a a TLS passthrough, the connection between the client and the proxy is still established after the proxy reconfiguration, and requests are still sent to the Pod to-be-terminated. It would be good to wait for the connections to drain from the Pod: With the current Quarkus 3/Vert.x 4 setup, the client only receives a HTTP/1.1 connection close or a HTTP/2 GOAWAY when the client sends the next request. So the longer the shutdown period, the higher the probability of the client sending a request, and therefore closing the connection. The longer the connection has been idle, the better the chance to just close it on shutdown and no request currently incoming, or the client closing it voluntarily. In such a setup, I'd say 10 seconds would be OK, 20-30 seconds would be good. There is still a slight probability to lose some requests. It will be better with Quarkus 4, Vert.x 5 and HTTP/2 as that will send a GOAWAY to those clients out-of-order when pre-shutdown starts, which is preferred (but unfortunately not supported in earlier versions).

In setups where the deployment procedure instructs the proxy ahead of killing the Pod, there is no need to have a pre-shutdown delay. It's quite the opposite: the admin would like to see a pre-shutdown period of 0.

In setups where the proxy polls the information which Pod should be in the load balancing by polling the readiness probe, it takes 1-2 poll cycles to the loadbalancer to recognize to remove the Pod. Assuming a poll interval of 5 seconds, and two poll cycles, the pre-shutdown period needs to be 10 seconds.

In the worst case you would have a proxy that polls the status from the Keycloak Pods, and TLS passthrough, and then those two delays add up. Given the examples above, you would have a 10 + 30 = 40 seconds shutdown period.

Looking at this wall of text, should we simplify this in a decision table?

Copy link
Contributor

@ryanemerson ryanemerson Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this wall of text, should we simplify this in a decision table?

If we do add this config, at least we have half the docs written already 😄

Thanks for the detailed explanation @ahus1

My 2c is that this configuration still feels very niche, we're improving our defaults compared to prior behaviour and things should improve again once we are able to upgrade to Quarkus 4.

It's easy to add many CLI options, but we know we can't easily remove said options except for in Major releases. Adding additional configuration toggles complicates our documentation and adds one more thing for users to consider, potentially causing confusion, when in most cases they shouldn't care.

Unless we have a body of existing issues related to such proxy setups and related issues, I think we should hold off adding the CLI options until we know this is a problem affecting (and causing bad experience) users.

Copy link
Member

@ahus1 ahus1 Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the only part that might be non-common is "An admin where they are different longer running requests for example on the admin API might play it safe and wants to set the shutdown period to 10 seconds."

All other scenarios of draining connections are very common, and also present in our blueprints as they use TLS passthrough. And our blueprint is actually over-simplified (or even invalid depending on your perspective) as it doesn't log client IP addresses, see keycloak/keycloak-benchmark#910, which is usually solved by a different proxy configuration. We've been ignoring this for a while, and people might consider that an security auditing problem.

Once you look outside of Kubernetes, all those ways to configure proxy load distribution are equally common.

Comment on lines 309 to 312
These values can be set in application.properties file.

NOTE: When Quarkus properties are defined directly in application.properties, environment variables may not override them.
This is general Quarkus configuration behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
These values can be set in application.properties file.
NOTE: When Quarkus properties are defined directly in application.properties, environment variables may not override them.
This is general Quarkus configuration behavior.
These values can be set in the quarkus.properties file.
NOTE: When Quarkus properties are defined directly in quarkus.properties, environment variables may not override them.

We should revisit whether we want to properly document that environment variables can be used for quarkus properties - certainly users are already doing that - and adjust the source ordinals accordingly.

ahus1
ahus1 previously requested changes Jan 29, 2026
Copy link
Member

@ahus1 ahus1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this pull request, see below for some changes needed to the docs.


[source,properties]
----
quarkus.shutdown.delay-enabled=true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruchikajha95 - for those things that people would reconfigure: For now I would consider that they change the delay and the timeout, but they wouldn't change the delay-enabled. delay-enabled is also a build-time options, which would need additional explanation.

So I suggest to remove "delay-enabled" here.

Comment on lines 282 to 290
== Graceful HTTP shutdown

When running {project_name} behind a reverse proxy or load balancer , it is important to allow in-flight requests to complete during server shutdown.

{project_name} enables graceful HTTP shutdown by default using Quarkus runtime configuration.

=== Default behavior

By default {project_name} configures Quarkus with short pre-shutdown delay and a bounded shutdown timeout:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this first set of docs. Please add the following information:

  • Explain the concepts: There is a pre-shutdown and a shutdown period. Explain what they are from the perspective of an administrator of Keycloak, and how Keycloak behaves in each period on a reasonably high level addressing a Keycloak admin. When you do, include the new additional readiness probe that marks the service "down" already during the pre-shutdown period. You can align the wording with the upstream Quarkus docs, but we wouldn't link to those docs as they don't take the Keycloak perspective, and our docs should be self-contained. Actually don't even mention Quarkus, as a Keycloak admin should not care about the fact that Keycloak runs Quarkus underneath.
  • Describe the default behavior in plain English, not by listing Quarkus properties.
  • When you describe how to configure the behavior, state in which file people would need to add those properties
  • Also list the matching environment variables that people can use - those would be named AFAIK QUARKUS_SHUTDOWN_DELAY and QUARKUS_SHUTDOWN_TIMEOUT. People on Kubernetes usually prefer environment variables

Please double-check with @ryanemerson how this can be configured via the Keycloak CR. I have the suspicion that Quarkus properties might not be available from the Keycloak CR.

Copy link
Contributor

@stianst stianst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should NOT document the use of Quarkus properties. They are not supported.

If we believe these are options worth documenting they should be turned into Keycloak properties.

@wcote-kz
Copy link

wcote-kz commented Feb 3, 2026

Hi @ahus1 , I have a keycloak cluster running 3 replicas, I did try draining the connection for graceful shutdown of the ongoing requests with the following parameters:

quarkus.shutdown.delay-enabled=true
quarkus.shutdown.delay=
quarkus.shutdown.timeout=

even with those configured, the keycloak replica currently draining (let's say the keycloak no. 1) continues processing job and requests even with traffic not routed through its endpoint. How is that possible? Is it because the jobs and requests are assigned via the ispn cache? I also want to know if this PR is addressing this issues or it's something else.

Thank you

@ahus1
Copy link
Member

ahus1 commented Feb 4, 2026

@wcote-kz - please describe your setup so we have more context.

  • What values did you pass in those parameters?
  • What is the order of events when you attempted a graceful shutdown? Did you first reconfigure the loadbalancer, or did you first trigger a shutdown in Keycloak?
  • What kind of proxy are you using? Does it do TLS termination, or TLS passthrough?

Note that job processing is out-of-scope for this, this is only about processing incoming HTTP requests. The current release of Quarkus is also not very good in draining connections.

@wcote-kz
Copy link

wcote-kz commented Feb 4, 2026

@ahus1 Thank you for your reply, here's the setup I tested (3 replicas of keycloak in a k8s cluster):

    QUARKUS_SHUTDOWN_DELAY_ENABLED: "true"
    QUARKUS_SHUTDOWN_DELAY: "300" # 5 min delay
    QUARKUS_SHUTDOWN_TIMEOUT: "600" # 10 min timeout

I observed that keycloak/quarkus is correctly following the above shutdown schedule when performing a statefulset restart (to emulate a rolling upgrade).
During the first 5 min, I can see that the 1rst keycloak is writing in the logs that it is initiating the shutdown and is reporting "not ready" to the k8s loadbalancer. At this point, the 1rst keycloak is removed from the endpoints list of the service and shouldn't receive any new http request.
Here I expect the 1rst keycloak to shutdown after the delay without any issue, since I feel 5 min is more than enough to drain, but the whole time I can see what looks like work/job being done in the logs of the 1rst keycloak. When the 1rst keycloak is shutting down after the delay I get some errors from running api calls (that I run in a loop during my test) and in the admin console I get the "you need to refresh the page" error banner.

So my conclusion is that:

  • connections are still reaching the 1rst keycloak somehow even if the load balancer won't route new traffic to it.

My loadbalancer in k8s is traefik with tls termination and I have the same behavior with nginx.

Thank you.

@ahus1
Copy link
Member

ahus1 commented Feb 6, 2026

@wcote-kz - the current Quarkus setup is not very good in connection draining:

  • The current version doesn't yet send a HTTP/2 GOAWAY, and also no HTTP/1.1 connection close, so any client with a HTTP connection pool will probably not drain their connections unless they voluntarily close it. This will change with the next release. But as you are using TLS termination, I am surprised that both nginx and traefic continue to route requests to the node that is shutting down.
  • The graceful shutdown doesn't affect jobs running in the background for now, so that's expected.

I would have hoped that nginx and traefik would no longer route requests ... can you see that they reconfigure once the Pod is about to shut down?

@wcote-kz
Copy link

wcote-kz commented Feb 6, 2026

@ahus1 Yes I can see that the ingress controller (traefik or nginx) that points to the keycloak k8s svc is not routing traffic, since I can see the endpoints list goes from 3 keycloak IPs:8080 to 2 keycloak IPs:8080 when one is reporting not ready and in the shutdown period. So I'm confident that the networking part is correct.

I just haven't figured out yet why when the keycloak actually gets stopped I get the issues mentioned in my other comment even if no traffic is going though it technically.

That's why I was thinking maybe there's some sort of "job/request autobalancing" from keycloak to keycloak via the ispn cache that is done after an online keycloak receives a request that could send it to a "stopping" keycloak in the cluster.

With that said I don't know that this PR or the next Quarkus version are addressing that behavior.

Thank you for your reply

@ahus1
Copy link
Member

ahus1 commented Feb 9, 2026

Add two more CLI options, as we require them for proxy configurations. We ruled out to not have SPI options, as this is not about SPIs:

--shutdown-delay: ...
--shutdown-timeout: ... 

Add documentation stating that we are handling HTTP at the moment, and explain how they work, and that the functionality will change in the future.

@vmuzikar
Copy link
Contributor

Add two more CLI options

+1 for those two.

ruchikajha95 added a commit to ruchikajha95/keycloak that referenced this pull request Feb 12, 2026
@ruchikajha95
Copy link
Contributor Author

Thanks @pruivo for the review . I have made the changes .

@ruchikajha95 ruchikajha95 force-pushed the feature-43589/HTTP_graceful_shutdown branch from 7a93ae5 to 6ea36b1 Compare February 12, 2026 12:44
@ruchikajha95 ruchikajha95 requested review from ahus1 and pruivo February 12, 2026 12:49
pruivo
pruivo previously requested changes Feb 12, 2026
Signed-off-by: Alexander Schwartz <[email protected]>
@pruivo
Copy link
Member

pruivo commented Feb 12, 2026

@ruchikajha95, the failed test HealthDistTest needs to be updated. The check count is 3.

$ curl --insecure https://keycloak:9000/health/ready
{
    "status": "UP",
    "checks": [
        {
            "name": "Graceful Shutdown",
            "status": "UP"
        },
        {
            "name": "Keycloak cluster health check",
            "status": "UP"
        },
        {
            "name": "Keycloak database connections async health check",
            "status": "UP"
        }
    ]
}

@ahus1
Copy link
Member

ahus1 commented Feb 12, 2026

@ruchikajha95 , @pruivo - I'll push a change in a minute, I was just now reviewing it.

Signed-off-by: Alexander Schwartz <[email protected]>
@ahus1 ahus1 dismissed stale reviews from pruivo, stianst, and themself February 12, 2026 16:38

outdated

@ahus1
Copy link
Member

ahus1 commented Feb 12, 2026

@ruchikajha95 / @pruivo - Thank you for the updated PR, it looks good to me.

I've updated the docs around the feature, see 89a0311:

  • The "Load balancer polls readiness probe" was not so specific as it IMHO should be: With 2 poll cycles of 5 seconds, you need to wait 3 cycles, as at the moment of shutdown, the previous one might have just finished. And then you add the time for the proxy to reconfigure.
  • The value of 10-30 seconds for TLS passthrough ended up in the Timeout column, while it should IMHO end up in the Delay column.
  • Due to that, the last combined scenario is off, so I updated it as well.
  • I cleared the cells that are not relevant to the respective example.
  • I've updated the example configurations to match the new values of the table above.

Please review my latest change if I mixed something up. If all is good, this should be good for merging.

Signed-off-by: Alexander Schwartz <[email protected]>
Signed-off-by: Alexander Schwartz <[email protected]>
@ahus1
Copy link
Member

ahus1 commented Feb 12, 2026

@ruchikajha95 - can you please add a test for the newly added CLI parameters to HttpDistTest.java? Given @pruivo's last comment, I now realize that I missed that in the earlier review. Thanks!

@ruchikajha95
Copy link
Contributor Author

@ahus1 Thanks for the further review . i will add the test changes .
@pruivo Thanks for the further review .

@pruivo
Copy link
Member

pruivo commented Feb 12, 2026

@ahus1, what tests do you have in mind?

@ahus1
Copy link
Member

ahus1 commented Feb 12, 2026

what tests do you have in mind?

Sorry for not being specific. I'd like to see a test that the CLI options are accepted or return any errors. Additional tests around the shutdown functionality are probably out of scope and difficult to test. Unless you have ideas for that...

@ahus1 ahus1 marked this pull request as ready for review February 13, 2026 13:28
@ahus1 ahus1 requested review from a team as code owners February 13, 2026 13:28
@ahus1 ahus1 enabled auto-merge (squash) February 13, 2026 13:29
Copy link
Member

@ahus1 ahus1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this change, @ruchikajha95, and everyone who helped reviewing and contributed!

@ahus1 ahus1 self-assigned this Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gracefully shutting down HTTP stack

9 participants