Enable graceful HTTP shutdown and document default behavior #45833

ruchikajha95 · 2026-01-28T14:11:40Z

ahus1

Thank you for this PR, looks great! Please see below for a note about using Quarkus properties in the docs.

ahus1 · 2026-01-28T15:36:29Z

docs/guides/server/reverseproxy.adoc

+[source,properties]
+----
+quarkus.shutdown.delay-enabled=true
+quarkus.shutdown.delay=1s
+quarkus.shutdown.timeout=1s
+----


@ruchikajha95 / @ryanemerson / @vmuzikar - some time ago we discussed that setting Quarkus options in Keycloak config is not supported, and therefore we restrained ourselves from naming any Quarkus properties in our docs.

So I think we have two routes here:

We think it would never needs to be changed by a user -> then we remove it from the docs

We think this is something that is supported to be changed -> then we change it to an SPI option or a CLI option. CLI option is something that we decided against in our team, so this would leave the SPI.

Please comment and share your thoughts!

Me and @vmuzikar discussed the use of an SPI option on slack and the conclusion was that it wasn't really applicable here, as we're not actually configuring a Keycloak SPI.

some time ago we discussed that setting Quarkus options in Keycloak config is not supported, and therefore we restrained ourselves from naming any Quarkus properties in our docs.

As a third route, can we be pragmatic and reconsider this on a case-by-case basis?

as we're not actually configuring a Keycloak SPI

That hasn't stopped us in other locations from using the spi configuration mechanism for other things - for example allowed-system-variables.

As a third route, can we be pragmatic and reconsider this on a case-by-case basis?

We're going to probably be in a similar situation with all the quarkus orm properties - having all of those as first-class / cli options seems like burden. If there's something we can agree on that's more pragmatic, we'll likely reuse that approach.

I'm mostly fine with the idea of differentiating between supported (quarkus.shutdown., quarkus.hibernate-orm.) and unsupported quarkus options - then allow those options to be present in either the quarkus.properties or properly utilized from the ENV (relates to the comment below about source ordinality).

We can always come back later and add proper keycloak options as needed.

The discussion I remembered was rotating around community and customers trying to achieve things with Quarkus parameters that we considered unsupported. And it was difficult for them to figure out what was supported and what not.

In the beginning we also thought of adding an Keycloak CLI/SPI as an abstraction to what Quarkus is doing to adjust the behavior as needed and smooth migrations.

I think that is owned by the CND team, so if you want to decide this differently, then ok. Still I think the situation hasn't changed - people will still be confused what is allowed and what not.

Any valid configuration that users of Keycloak should be able to do should be proper Keycloak options, and documented.

That applies to this PR and ORM settings, and anything else.

The only mention we should have on Quarkus options is in https://www.keycloak.org/server/configuration#_format_for_raw_quarkus_properties.

-1 for an SPI option.

SPI options are meant for configuring providers (first- or third-party). This is not an SPI we're configuring here rather than Quarkus behavior. From the support perspective, it does not matter whether it's full blown CLI option or an SPI option – as long as it's documented, it's supported (unless explicitly told otherwise).

The question here should be whether this should be just an escape hatch for dev/debugging purposes, or something documented and supported. If the first, I'd vote for CLI options. If the latter, we can stick to Quarkus options but we must not document it.

@ahus1 Had some examples where users might want to tweak the delay/timeout, can you share these here with the wider group so that we can try to determine how niche this config actually is?

One approach could be to set the default quarkus property values in this PR and then wait for user feedback to see if this tweaking is actually required by (m)any users before deciding whether to promote these to CLI options or not.

I'd say the 1 second pre-shutdown and 1 second shutdown phase are the maximum delays we could add to not call it a non breaking change.

When we run our test suite, we want to set the pre-shutdown phase to 0 seconds to not slow down our test suite.

I'd say users would want to set it up with longer periods depending on the their proxy configuration (edge/reencrypt vs. passthrough) and how the proxy gets the information about the shutdown (at the same time as the Pod, well before the pod, or the proxy polls it from the Pod).

The longer version:

The 1 second/1 second config might be the right thing when you run in Kubernetes where the load balancer is re-configured at the same time as the Pod gets the termination signal: It takes about a second (as @slaskawi described) for the loadbalancer to finish reconfiguration. If the reverse proxy is edge or reencrypt, the next requests are routed to the remaining pods. So a shutdown period is sufficient for the running requests to finish, and KC might finish early when there are no more running requests.

Regular login requests should rarely take longer than a second. An admin where they are different longer running requests for example on the admin API might play it safe and wants to set the shutdown period to 10 seconds.

In proxy-setup with a a TLS passthrough, the connection between the client and the proxy is still established after the proxy reconfiguration, and requests are still sent to the Pod to-be-terminated. It would be good to wait for the connections to drain from the Pod: With the current Quarkus 3/Vert.x 4 setup, the client only receives a HTTP/1.1 connection close or a HTTP/2 GOAWAY when the client sends the next request. So the longer the shutdown period, the higher the probability of the client sending a request, and therefore closing the connection. The longer the connection has been idle, the better the chance to just close it on shutdown and no request currently incoming, or the client closing it voluntarily. In such a setup, I'd say 10 seconds would be OK, 20-30 seconds would be good. There is still a slight probability to lose some requests. It will be better with Quarkus 4, Vert.x 5 and HTTP/2 as that will send a GOAWAY to those clients out-of-order when pre-shutdown starts, which is preferred (but unfortunately not supported in earlier versions).

In setups where the deployment procedure instructs the proxy ahead of killing the Pod, there is no need to have a pre-shutdown delay. It's quite the opposite: the admin would like to see a pre-shutdown period of 0.

In setups where the proxy polls the information which Pod should be in the load balancing by polling the readiness probe, it takes 1-2 poll cycles to the loadbalancer to recognize to remove the Pod. Assuming a poll interval of 5 seconds, and two poll cycles, the pre-shutdown period needs to be 10 seconds.

In the worst case you would have a proxy that polls the status from the Keycloak Pods, and TLS passthrough, and then those two delays add up. Given the examples above, you would have a 10 + 30 = 40 seconds shutdown period.

Looking at this wall of text, should we simplify this in a decision table?

Looking at this wall of text, should we simplify this in a decision table?

If we do add this config, at least we have half the docs written already 😄

Thanks for the detailed explanation @ahus1

My 2c is that this configuration still feels very niche, we're improving our defaults compared to prior behaviour and things should improve again once we are able to upgrade to Quarkus 4.

It's easy to add many CLI options, but we know we can't easily remove said options except for in Major releases. Adding additional configuration toggles complicates our documentation and adds one more thing for users to consider, potentially causing confusion, when in most cases they shouldn't care.

Unless we have a body of existing issues related to such proxy setups and related issues, I think we should hold off adding the CLI options until we know this is a problem affecting (and causing bad experience) users.

I'd say the only part that might be non-common is "An admin where they are different longer running requests for example on the admin API might play it safe and wants to set the shutdown period to 10 seconds."

All other scenarios of draining connections are very common, and also present in our blueprints as they use TLS passthrough. And our blueprint is actually over-simplified (or even invalid depending on your perspective) as it doesn't log client IP addresses, see keycloak/keycloak-benchmark#910, which is usually solved by a different proxy configuration. We've been ignoring this for a while, and people might consider that an security auditing problem.

Once you look outside of Kubernetes, all those ways to configure proxy load distribution are equally common.

shawkins · 2026-01-28T16:43:23Z

docs/guides/server/reverseproxy.adoc

+These values can be set in application.properties file.
+
+NOTE: When Quarkus properties are defined directly in application.properties, environment variables may not override them.
+This is general Quarkus configuration behavior.


Suggested change

These values can be set in application.properties file.

NOTE: When Quarkus properties are defined directly in application.properties, environment variables may not override them.

This is general Quarkus configuration behavior.

These values can be set in the quarkus.properties file.

NOTE: When Quarkus properties are defined directly in quarkus.properties, environment variables may not override them.

We should revisit whether we want to properly document that environment variables can be used for quarkus properties - certainly users are already doing that - and adjust the source ordinals accordingly.

ahus1

Thank you for this pull request, see below for some changes needed to the docs.

ahus1 · 2026-01-29T12:14:16Z

docs/guides/server/reverseproxy.adoc

+
+[source,properties]
+----
+quarkus.shutdown.delay-enabled=true


@ruchikajha95 - for those things that people would reconfigure: For now I would consider that they change the delay and the timeout, but they wouldn't change the delay-enabled. delay-enabled is also a build-time options, which would need additional explanation.

So I suggest to remove "delay-enabled" here.

ahus1 · 2026-01-29T12:26:43Z

docs/guides/server/reverseproxy.adoc

+== Graceful HTTP shutdown
+
+When running {project_name} behind a reverse proxy or load balancer , it is important to allow in-flight requests to complete during server shutdown.
+
+{project_name} enables graceful HTTP shutdown by default using Quarkus runtime configuration.
+
+=== Default behavior
+
+By default {project_name} configures Quarkus with short pre-shutdown delay and a bounded shutdown timeout:


Thank you for this first set of docs. Please add the following information:

Explain the concepts: There is a pre-shutdown and a shutdown period. Explain what they are from the perspective of an administrator of Keycloak, and how Keycloak behaves in each period on a reasonably high level addressing a Keycloak admin. When you do, include the new additional readiness probe that marks the service "down" already during the pre-shutdown period. You can align the wording with the upstream Quarkus docs, but we wouldn't link to those docs as they don't take the Keycloak perspective, and our docs should be self-contained. Actually don't even mention Quarkus, as a Keycloak admin should not care about the fact that Keycloak runs Quarkus underneath.

Describe the default behavior in plain English, not by listing Quarkus properties.

When you describe how to configure the behavior, state in which file people would need to add those properties

Also list the matching environment variables that people can use - those would be named AFAIK QUARKUS_SHUTDOWN_DELAY and QUARKUS_SHUTDOWN_TIMEOUT. People on Kubernetes usually prefer environment variables

Please double-check with @ryanemerson how this can be configured via the Keycloak CR. I have the suspicion that Quarkus properties might not be available from the Keycloak CR.

stianst

We should NOT document the use of Quarkus properties. They are not supported.

If we believe these are options worth documenting they should be turned into Keycloak properties.

wcote-kz · 2026-02-03T16:55:37Z

Hi @ahus1 , I have a keycloak cluster running 3 replicas, I did try draining the connection for graceful shutdown of the ongoing requests with the following parameters:

quarkus.shutdown.delay-enabled=true
quarkus.shutdown.delay=
quarkus.shutdown.timeout=

even with those configured, the keycloak replica currently draining (let's say the keycloak no. 1) continues processing job and requests even with traffic not routed through its endpoint. How is that possible? Is it because the jobs and requests are assigned via the ispn cache? I also want to know if this PR is addressing this issues or it's something else.

Thank you

ahus1 · 2026-02-04T08:15:10Z

@wcote-kz - please describe your setup so we have more context.

What values did you pass in those parameters?
What is the order of events when you attempted a graceful shutdown? Did you first reconfigure the loadbalancer, or did you first trigger a shutdown in Keycloak?
What kind of proxy are you using? Does it do TLS termination, or TLS passthrough?

Note that job processing is out-of-scope for this, this is only about processing incoming HTTP requests. The current release of Quarkus is also not very good in draining connections.

wcote-kz · 2026-02-04T14:34:40Z

@ahus1 Thank you for your reply, here's the setup I tested (3 replicas of keycloak in a k8s cluster):

    QUARKUS_SHUTDOWN_DELAY_ENABLED: "true"
    QUARKUS_SHUTDOWN_DELAY: "300" # 5 min delay
    QUARKUS_SHUTDOWN_TIMEOUT: "600" # 10 min timeout

I observed that keycloak/quarkus is correctly following the above shutdown schedule when performing a statefulset restart (to emulate a rolling upgrade).
During the first 5 min, I can see that the 1rst keycloak is writing in the logs that it is initiating the shutdown and is reporting "not ready" to the k8s loadbalancer. At this point, the 1rst keycloak is removed from the endpoints list of the service and shouldn't receive any new http request.
Here I expect the 1rst keycloak to shutdown after the delay without any issue, since I feel 5 min is more than enough to drain, but the whole time I can see what looks like work/job being done in the logs of the 1rst keycloak. When the 1rst keycloak is shutting down after the delay I get some errors from running api calls (that I run in a loop during my test) and in the admin console I get the "you need to refresh the page" error banner.

So my conclusion is that:

connections are still reaching the 1rst keycloak somehow even if the load balancer won't route new traffic to it.

My loadbalancer in k8s is traefik with tls termination and I have the same behavior with nginx.

Thank you.

ahus1 · 2026-02-06T18:42:17Z

@wcote-kz - the current Quarkus setup is not very good in connection draining:

The current version doesn't yet send a HTTP/2 GOAWAY, and also no HTTP/1.1 connection close, so any client with a HTTP connection pool will probably not drain their connections unless they voluntarily close it. This will change with the next release. But as you are using TLS termination, I am surprised that both nginx and traefic continue to route requests to the node that is shutting down.
The graceful shutdown doesn't affect jobs running in the background for now, so that's expected.

I would have hoped that nginx and traefik would no longer route requests ... can you see that they reconfigure once the Pod is about to shut down?

wcote-kz · 2026-02-06T18:57:15Z

@ahus1 Yes I can see that the ingress controller (traefik or nginx) that points to the keycloak k8s svc is not routing traffic, since I can see the endpoints list goes from 3 keycloak IPs:8080 to 2 keycloak IPs:8080 when one is reporting not ready and in the shutdown period. So I'm confident that the networking part is correct.

I just haven't figured out yet why when the keycloak actually gets stopped I get the issues mentioned in my other comment even if no traffic is going though it technically.

That's why I was thinking maybe there's some sort of "job/request autobalancing" from keycloak to keycloak via the ispn cache that is done after an online keycloak receives a request that could send it to a "stopping" keycloak in the cluster.

With that said I don't know that this PR or the next Quarkus version are addressing that behavior.

Thank you for your reply

ahus1 · 2026-02-09T14:47:31Z

Add two more CLI options, as we require them for proxy configurations. We ruled out to not have SPI options, as this is not about SPIs:

--shutdown-delay: ...
--shutdown-timeout: ...

Add documentation stating that we are handling HTTP at the moment, and explain how they work, and that the functionality will change in the future.

vmuzikar · 2026-02-11T15:38:40Z

Add two more CLI options

+1 for those two.

...me/src/main/java/org/keycloak/quarkus/runtime/configuration/mappers/HttpPropertyMappers.java

docs/guides/server/reverseproxy.adoc

Closes keycloak#45833 Signed-off-by: Ruchika <[email protected]>

ruchikajha95 · 2026-02-12T10:04:29Z

Thanks @pruivo for the review . I have made the changes .

Closes keycloak#43589 Signed-off-by: Ruchika <[email protected]>

…r the PR review comments Closes keycloak#43589 Signed-off-by: Ruchika <[email protected]>

Closes keycloak#45833 Signed-off-by: Ruchika <[email protected]>

...me/src/main/java/org/keycloak/quarkus/runtime/configuration/mappers/HttpPropertyMappers.java

quarkus/runtime/src/main/resources/application.properties

Signed-off-by: Alexander Schwartz <[email protected]>

pruivo · 2026-02-12T16:07:25Z

@ruchikajha95, the failed test HealthDistTest needs to be updated. The check count is 3.

keycloak/quarkus/tests/integration/src/test/java/org/keycloak/it/cli/dist/HealthDistTest.java

Line 86 in a2c1055

.body("checks.size()", equalTo(2));

$ curl --insecure https://keycloak:9000/health/ready
{
    "status": "UP",
    "checks": [
        {
            "name": "Graceful Shutdown",
            "status": "UP"
        },
        {
            "name": "Keycloak cluster health check",
            "status": "UP"
        },
        {
            "name": "Keycloak database connections async health check",
            "status": "UP"
        }
    ]
}

ahus1 · 2026-02-12T16:36:53Z

@ruchikajha95 , @pruivo - I'll push a change in a minute, I was just now reviewing it.

Signed-off-by: Alexander Schwartz <[email protected]>

outdated

ahus1 · 2026-02-12T16:49:20Z

@ruchikajha95 / @pruivo - Thank you for the updated PR, it looks good to me.

I've updated the docs around the feature, see 89a0311:

The "Load balancer polls readiness probe" was not so specific as it IMHO should be: With 2 poll cycles of 5 seconds, you need to wait 3 cycles, as at the moment of shutdown, the previous one might have just finished. And then you add the time for the proxy to reconfigure.
The value of 10-30 seconds for TLS passthrough ended up in the Timeout column, while it should IMHO end up in the Delay column.
Due to that, the last combined scenario is off, so I updated it as well.
I cleared the cells that are not relevant to the respective example.
I've updated the example configurations to match the new values of the table above.

Please review my latest change if I mixed something up. If all is good, this should be good for merging.

Signed-off-by: Alexander Schwartz <[email protected]>

ahus1 · 2026-02-12T17:01:42Z

@ruchikajha95 - can you please add a test for the newly added CLI parameters to HttpDistTest.java? Given @pruivo's last comment, I now realize that I missed that in the earlier review. Thanks!

ruchikajha95 · 2026-02-12T18:12:11Z

@ahus1 Thanks for the further review . i will add the test changes .
@pruivo Thanks for the further review .

pruivo · 2026-02-12T18:17:11Z

@ahus1, what tests do you have in mind?

ahus1 · 2026-02-12T18:41:19Z

what tests do you have in mind?

Sorry for not being specific. I'd like to see a test that the CLI options are accepted or return any errors. Additional tests around the shutdown functionality are probably out of scope and difficult to test. Unless you have ideas for that...

Closes keycloak#45381 Signed-off-by: Ruchika <[email protected]>

Signed-off-by: Alexander Schwartz <[email protected]>

ahus1

Thank you for this change, @ruchikajha95, and everyone who helped reviewing and contributed!

ruchikajha95 force-pushed the feature-43589/HTTP_graceful_shutdown branch from 95775e2 to 9de4232 Compare January 28, 2026 14:41

ruchikajha95 requested review from ahus1 and ryanemerson January 28, 2026 14:42

ahus1 requested changes Jan 28, 2026

View reviewed changes

shawkins reviewed Jan 28, 2026

View reviewed changes

ahus1 previously requested changes Jan 29, 2026

View reviewed changes

stianst previously requested changes Jan 29, 2026

View reviewed changes

shawkins mentioned this pull request Jan 29, 2026

Expose hibernate-orm properties in a supported manner #41465

Open

This was referenced Feb 3, 2026

Gracefully shutting down HTTP stack #43589

Open

Graceful Shutdown with connection draining #30986

Closed

slaskawi mentioned this pull request Feb 4, 2026

feat: improve Keycloak availability defenseunicorns/uds-core#2334

Merged

5 tasks

pruivo requested changes Feb 11, 2026

View reviewed changes

ruchikajha95 added a commit to ruchikajha95/keycloak that referenced this pull request Feb 12, 2026

Fixed PR review comments

7a93ae5

Closes keycloak#45833 Signed-off-by: Ruchika <[email protected]>

ruchikajha95 added 3 commits February 12, 2026 12:36

Enable graceful HTTP shutdown and document default behavior

e1a0edd

Closes keycloak#43589 Signed-off-by: Ruchika <[email protected]>

Fixed the quarkus properties using CLI and updated the document as pe…

4c779c9

…r the PR review comments Closes keycloak#43589 Signed-off-by: Ruchika <[email protected]>

Fixed PR review comments

6ea36b1

Closes keycloak#45833 Signed-off-by: Ruchika <[email protected]>

ruchikajha95 force-pushed the feature-43589/HTTP_graceful_shutdown branch from 7a93ae5 to 6ea36b1 Compare February 12, 2026 12:44

ruchikajha95 requested review from ahus1 and pruivo February 12, 2026 12:49

pruivo previously requested changes Feb 12, 2026

View reviewed changes

...me/src/main/java/org/keycloak/quarkus/runtime/configuration/mappers/HttpPropertyMappers.java Show resolved Hide resolved

quarkus/runtime/src/main/resources/application.properties Outdated Show resolved Hide resolved

Adding additional check

428422e

Signed-off-by: Alexander Schwartz <[email protected]>

Docs review

89a0311

Signed-off-by: Alexander Schwartz <[email protected]>

ahus1 added 2 commits February 12, 2026 17:54

Refinement

9bbfca2

Signed-off-by: Alexander Schwartz <[email protected]>

Allow value of zero

941ffd6

Signed-off-by: Alexander Schwartz <[email protected]>

ruchikajha95 and others added 2 commits February 13, 2026 12:40

Fixed PR review comments: testcases for new http property

f5191d2

Closes keycloak#45381 Signed-off-by: Ruchika <[email protected]>

Adding release notes

ef58bae

Signed-off-by: Alexander Schwartz <[email protected]>

ahus1 marked this pull request as ready for review February 13, 2026 13:28

ahus1 requested review from a team as code owners February 13, 2026 13:28

keycloak-github-bot bot added team/cloud-native team/sre labels Feb 13, 2026

ahus1 enabled auto-merge (squash) February 13, 2026 13:29

ahus1 approved these changes Feb 13, 2026

View reviewed changes

ahus1 self-assigned this Feb 13, 2026

Enable graceful HTTP shutdown and document default behavior #45833

Are you sure you want to change the base?

Enable graceful HTTP shutdown and document default behavior #45833

Conversation

ruchikajha95 commented Jan 28, 2026

Uh oh!

ahus1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stianst Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryanemerson Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahus1 Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahus1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stianst left a comment

Choose a reason for hiding this comment

Uh oh!

wcote-kz commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahus1 commented Feb 4, 2026

Uh oh!

wcote-kz commented Feb 4, 2026

Uh oh!

ahus1 commented Feb 6, 2026

Uh oh!

wcote-kz commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahus1 commented Feb 9, 2026

Uh oh!

vmuzikar commented Feb 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ruchikajha95 commented Feb 12, 2026

Uh oh!

Uh oh!

Uh oh!

pruivo commented Feb 12, 2026

Uh oh!

ahus1 commented Feb 12, 2026

Uh oh!

ahus1 commented Feb 12, 2026

Uh oh!

stianst Jan 29, 2026 •

edited

Loading

ryanemerson Feb 3, 2026 •

edited

Loading

ahus1 Feb 3, 2026 •

edited

Loading

wcote-kz commented Feb 3, 2026 •

edited

Loading

wcote-kz commented Feb 6, 2026 •

edited

Loading