Add configurable HTTP retry mechanism for OCSP validation #42535

chance-coleman · 2025-09-10T20:30:56Z

Description

Related Issue #42401

This PR adds two key improvements to Keycloak:

General-Purpose HTTP Retry Mechanism: Implements a general-purpose HTTP retry mechanism that allows automatic retrying of HTTP requests when they fail due to transient issues such as network timeouts, temporary server errors, or other recoverable conditions.
OCSP Retry UI Configuration: Adds UI configuration inputs for OCSP retry count and timeout settings in the X.509 client certificate authentication flow. This allows administrators to configure retry behavior for OCSP certificate validation through the Keycloak admin UI.

The implementation extends the existing HttpClientProvider interface to provide retriable HTTP clients that can be used throughout the codebase. This approach provides a consistent way to handle retries across different components, including OCSP certificate validation.

Motivation

In environments with unreliable network connections or when interacting with services that may experience temporary outages, HTTP requests can fail unnecessarily. This is particularly problematic for critical operations like OCSP certificate validation, where a temporary network issue could prevent a user from authenticating.

By implementing an opt-in general-purpose retry mechanism, we can improve the reliability of Keycloak in these environments without changing default behavior or duplicating retry logic across different components.

Implements a configurable retry mechanism for OCSP certificate validation with UI configuration in the X.509 client certificate authenticator. Also adds a general-purpose HTTP retry mechanism with exponential backoff and jitter that can be used throughout the codebase. The implementation is opt-in by default (0 retries) to maintain backward compatibility while allowing configuration when needed. Closes keycloak#42401 Signed-off-by: UnicornChance <[email protected]> cleanup Signed-off-by: UnicornChance <[email protected]> cleanup Signed-off-by: UnicornChance <[email protected]> cleanup Signed-off-by: UnicornChance <[email protected]> cleanup Signed-off-by: UnicornChance <[email protected]>

services/src/main/java/org/keycloak/connections/httpclient/DefaultHttpClientFactory.java

Signed-off-by: UnicornChance <[email protected]>

slaskawi

LGTM with a small nitpick, which is up to you to implement.

services/src/main/java/org/keycloak/connections/httpclient/DefaultHttpClientFactory.java

Signed-off-by: UnicornChance <[email protected]>

rmartinc

Thanks @chance-coleman for the PR!

I have added two comments but I want that @mposolda also checks this. So do not start changing code until he also reviews the PR and gives his opinion.

rmartinc · 2025-09-12T16:31:03Z

services/src/main/java/org/keycloak/connections/httpclient/DefaultHttpClientFactory.java

+            }
+
+            @Override
+            public CloseableHttpClient getRetriableHttpClient(RetryConfig retryConfig) {


When I thought about this my initial idea was adding just one retriable configuration for everything. So this method would be getRetriableHttpClient() and the retry options are just defined in this factory as options (similar to other options for proxy or similar in the normal client). This would simplify the configuration and would remove the need to add the UI options for the places that we decide to use the retry. This way adding this to CRL for example is just changing the method without adding any new configuration options. Besides it avoids a lot of the burden in this PR. WDYT?

The same configuration should be used for both clients, but the retriable one adds the specifics for the retry. There are only two clients defined that are initialize only once each. By default (if no retry is configured) the same client is used.

ahh i can see how this would be simpler and more consistent. the only downside i can think of here is less flexibility in how retries would be approached. I'm open to this implementation if that is what you all would prefer.

@chance-coleman has a point here in my opinion. Depending on the area of the code, we may want to use different setting. In other words, what will work correctly for OCSP might not be a good fit for Identity Providers or other functionalities.

Yep, this is just a matter of what we prefer. Let's see what other people think aboy this.

+1 for just getRetriableHttpClient() . If we ever need different configuration for different "use-cases" of HTTP client, we can add new method if needed.

I can imagine that we may want different configurations for different use-cases, however this might mean also different configurations for other aspects than just retry configuration. For more flexibility, maybe the best is to have the method like:

@Override public CloseableHttpClient getHttpClient(String context) {

where context is for example something like "ocsp" or "facebook-idp" . Then in the configuration of Http-client provider, we can add namespaces for various configuration options. For example we can have ocsp--proxy-mappings if we want to have different proxy-mappings configuration for the case when HTTP client is called in the context of "ocsp"
. This will also allow to have different retry configurations for "ocsp" and different for other things. The Config.Scope has some support for namespaces in itself AFAIK.

I do not insist on this, but would prefer something like this instead of introducing getRetriableHttpClient(RetryConfig) , which IMO does not have much flexibility and at the same time, it is not used anywhere right now.

The getRetriableHttpClient() is also less ideal than namespaces IMO, but maybe good compromise if others prefer this instead of introducing "namespaces" .

@mposolda i can see the benefits in this approach. The context-based approach is more flexible and maintainable in the long run. The main trade-off is slightly more complex configuration. It's definitely a balance and some preference, ultimately it's your codebase so i'm happy to make changes either way!

rmartinc · 2025-09-12T16:35:24Z

services/src/main/java/org/keycloak/connections/httpclient/DefaultHttpClientFactory.java

+                                            (Math.random() * retryConfig.getJitterFactor() * 2.0);
+                                    delay = (long)(baseDelay * jitter);
+                                }
+                                Thread.sleep(delay);


I really don't like this sleep. What is the reason for this? Normal DefaultHttpRequestRetryHandler is not enough for you?

Hmmm good question, my understanding was that Apache HttpClient 4.5.14 DefaultHttpRequestRetryHandler does not implement any backoff or delay between retries. The retry logic simply checks if we should retry and immediately returns the decision.

The sleep in the code was necessary to implement the exponential backoff with jitter functionality. Without it, the retries would happen in rapid succession, which i believe to be a more robust solution, although more complex.

There is also ServiceUnavailableRetryStrategy . Could not that one be used instead of implementing the retry logic in our own codebase? For example see the 2nd post from https://stackoverflow.com/questions/48541329/timeout-between-request-retries-apache-httpclient .

While ServiceUnavailableRetryStrategy is great for fine-tuning HTTP error responses, it's limited to HTTP-layer issues and doesn't handle lower-level network problems. Our current solution handles the full spectrum of failures; from TLS errors and dropped connections to server errors, in one straightforward implementation. This single-point approach ensures reliable retries whether the issue is in the network stack or the HTTP layer, without the complexity of managing multiple retry strategies.

mposolda

@chance-coleman @rmartinc @slaskawi Added some additional review comments. WDYT?

mposolda · 2025-09-15T08:00:28Z

services/src/main/java/org/keycloak/connections/httpclient/DefaultHttpClientFactory.java

+                                            (Math.random() * retryConfig.getJitterFactor() * 2.0);
+                                    delay = (long)(baseDelay * jitter);
+                                }
+                                Thread.sleep(delay);


There is also ServiceUnavailableRetryStrategy . Could not that one be used instead of implementing the retry logic in our own codebase? For example see the 2nd post from https://stackoverflow.com/questions/48541329/timeout-between-request-retries-apache-httpclient .

mposolda · 2025-09-15T08:15:36Z

services/src/main/java/org/keycloak/connections/httpclient/DefaultHttpClientFactory.java

+            }
+
+            @Override
+            public CloseableHttpClient getRetriableHttpClient(RetryConfig retryConfig) {


+1 for just getRetriableHttpClient() . If we ever need different configuration for different "use-cases" of HTTP client, we can add new method if needed.

I can imagine that we may want different configurations for different use-cases, however this might mean also different configurations for other aspects than just retry configuration. For more flexibility, maybe the best is to have the method like:

@Override public CloseableHttpClient getHttpClient(String context) {

where context is for example something like "ocsp" or "facebook-idp" . Then in the configuration of Http-client provider, we can add namespaces for various configuration options. For example we can have ocsp--proxy-mappings if we want to have different proxy-mappings configuration for the case when HTTP client is called in the context of "ocsp"
. This will also allow to have different retry configurations for "ocsp" and different for other things. The Config.Scope has some support for namespaces in itself AFAIK.

I do not insist on this, but would prefer something like this instead of introducing getRetriableHttpClient(RetryConfig) , which IMO does not have much flexibility and at the same time, it is not used anywhere right now.

The getRetriableHttpClient() is also less ideal than namespaces IMO, but maybe good compromise if others prefer this instead of introducing "namespaces" .

mposolda · 2025-09-15T08:29:39Z

services/src/main/java/org/keycloak/connections/httpclient/DefaultHttpClientFactory.java

                .defaultValue(HttpClientProvider.DEFAULT_MAX_CONSUMED_RESPONSE_SIZE)
                .add()
+                .property()
+                .name("http-client.default-max-retries")


I am not sure why those configuration options have prefix http-client here? Other config options of the provider do not have this prefix.

Also there is one confusing aspect: When someone reads the documentation of this provider, he would see http-client-default-max-retries configured to 3, which imply that this is used by default by HTTP client. But AFAIK, this is used just if getRetriableHttpClient() is called right? By default, the getHttpClient() or other methods do not use this retry configuration at all and requests are not retried. Which is not clear from this docs...

Might be better if we have something like .name("default-max-retries") (which would be 0 by default and hence disabled retries) and .name("ocsp--default-max-retries") (which would be 3 and hence enabled retries for OCSP by default). I proposed in the other comment to use "namespaces" for the configurations. This will be likely harder to implement, but seems to me like more clear configuration.

slaskawi · 2025-09-29T09:22:24Z

@chance-coleman @rmartinc @slaskawi Added some additional review comments. WDYT?

@mposolda LGTM! (I already added my approval and together with @chance-coleman we addressed my comments before this PR has been created).

slaskawi · 2025-10-06T05:31:47Z

Hey @mposolda 👋🏻

Could I ask you to have a look at this one? It seems @chance-coleman addressed all the comments and I think we should be close to getting it in.

stianst

Not convinced about this should be configured on the X509 authenticator specifically. Looking at the issue is explicitly mentions OCSP. However, if network traffic is unstable then this will affect any outgoing HTTP requests from Keycloak.

I would argue we should instead consider having config options for this for the HTTP client provider in general.

One thing to bear in mind here though if there are a few retries, and the timeout is large, then the original incoming HTTP request to Keycloak would likely have timed out in the meantime. To me that is another argument to make it a server wide configuration option, as incoming request timeout should be higher than the total of retries + timeouts for outgoing requests.

@ahus1 @shawkins WDYT?

stianst

Thinking more about this, it really isn't something we should make configurable as fine-grained as within the X509 authenticator, but just make a server wide config option for outgoing http requests

slaskawi · 2025-10-15T18:22:51Z

Thinking more about this, it really isn't something we should make configurable as fine-grained as within the X509 authenticator, but just make a server wide config option for outgoing http requests

@stianst Could you please tell us more about the reasoning behind this?

When we were discussing this with @chance-coleman, we were thinking about minimizing the impact of introducing this configuration (so exactly opposite to your thoughts). The main argument here was that Keycloak can have multiple Realms, Orgs and each one of them might use OCSPs with different availability SLAs. Some of them might be flaky by nature (and we should retry on them) and some are expected to rock solid (and retrying them just increases a storm of requests and generally doesn't help bringing it back online).

So I wanted to double check if we really would like to make this a server-level setting affecting all Realms, Orgs and all the egress traffic from Keycloak.

stianst · 2025-10-16T08:17:27Z

Thinking more about this, it really isn't something we should make configurable as fine-grained as within the X509 authenticator, but just make a server wide config option for outgoing http requests

@stianst Could you please tell us more about the reasoning behind this?

When we were discussing this with @chance-coleman, we were thinking about minimizing the impact of introducing this configuration (so exactly opposite to your thoughts). The main argument here was that Keycloak can have multiple Realms, Orgs and each one of them might use OCSPs with different availability SLAs. Some of them might be flaky by nature (and we should retry on them) and some are expected to rock solid (and retrying them just increases a storm of requests and generally doesn't help bringing it back online).

So I wanted to double check if we really would like to make this a server-level setting affecting all Realms, Orgs and all the egress traffic from Keycloak.

What about client backchannel logout, brokering requests, etc.. They can also be flaky. Also, really don't think network settings is something a realm admin should care about, that's more the server admin (and person that manages the environment).

I'd start with just a global setting; then perhaps add some sort of regex where you can configure what URLs should have retries and what should not. Again really think this is a server/environment config thing, and not something that should be managed by realm admins.

Refactored retry behavior from per-feature configuration to a simpler server-wide approach based on reviewer feedback. This eliminates UI complexity and provides consistent retry behavior across all HTTP clients. Key changes: - Removed OCSP-specific retry configuration from X.509 authenticator UI - Simplified HttpClientProvider interface (removed parameterized method) - Consolidated retry configuration to DefaultHttpClientFactory - Updated documentation to reflect server-wide configuration - Renamed timeout properties to avoid conflicts with existing settings Configuration is now opt-in (max-retries=0 by default) and applies to all outgoing HTTP requests including OCSP validation, identity provider communication, and other external calls. Signed-off-by: UnicornChance <[email protected]>

server-spi-private/src/main/java/org/keycloak/connections/httpclient/HttpClientProvider.java

server-spi-private/src/main/java/org/keycloak/utils/OCSPProvider.java

.../keycloak/authentication/authenticators/x509/AbstractX509ClientCertificateAuthenticator.java

...ak/authentication/authenticators/x509/AbstractX509ClientCertificateAuthenticatorFactory.java

services/src/main/java/org/keycloak/connections/httpclient/DefaultHttpClientFactory.java

…ation Refactored HTTP client retry implementation based on reviewer feedback to use a single HTTP client with built-in retry behavior instead of separate retriable and non-retriable clients. Signed-off-by: UnicornChance <[email protected]>

…ation Refactored HTTP client retry implementation based on reviewer feedback to use a single HTTP client with built-in retry behavior instead of separate retriable and non-retriable clients. Key changes: - Removed getRetriableHttpClient() method from HttpClientProvider interface - Merged retry functionality into the main httpClient in DefaultHttpClientFactory - Extracted retry configuration to configureRetries() helper method for cleaner code - Renamed retry-on-io-exception property to retry-on-error (more generic) - Removed per-feature retry configuration from X.509 authenticator - Updated OCSPProvider to use standard getHttpClient() method - Removed unused OCSP_CONNECT_TIMEOUT constant from OCSPProvider - Removed retry-specific timeout properties (using standard socket-timeout-millis) Signed-off-by: UnicornChance <[email protected]>

slaskawi

The code LGTM! Requested only doc changes and a few nit fixes.

docs/documentation/server_admin/topics/authentication/x509.adoc

services/src/main/java/org/keycloak/connections/httpclient/DefaultHttpClientFactory.java

slaskawi

LGTM! Thanks for all the changes @chance-coleman !

@stianst This one is ready for re-review.

slaskawi · 2025-10-21T11:10:14Z

@chance-coleman I've just spotted - the DCO job is failing. You probably need to add a sign-off

Signed-off-by: UnicornChance <[email protected]>

stianst · 2025-10-24T06:45:49Z

docs/guides/server/outgoinghttp.adoc

 *disable-trust-manager*::
 If an outgoing request requires HTTPS and this configuration option is set to true, you do not have to specify a truststore. This setting should be used only during development and *never in production* because it will disable verification of SSL certificates. Default: false.

+== Configuring retry behavior for outgoing HTTP requests


I would add something around the fact that outgoing request retries should not exceed the timeout for incoming requests

I've added a callout like this:

IMPORTANT: Do not let outgoing retry duration exceed the caller’s timeout. Otherwise, the caller may time out and see an error while {project_name} continues retrying in the background.

…priately Signed-off-by: UnicornChance <[email protected]>

chance-coleman marked this pull request as ready for review September 11, 2025 12:31

chance-coleman requested review from a team as code owners September 11, 2025 12:31

keycloak-github-bot bot added team/cloud-native team/core-clients team/core-iam labels Sep 11, 2025

slaskawi reviewed Sep 11, 2025

View reviewed changes

services/src/main/java/org/keycloak/connections/httpclient/DefaultHttpClientFactory.java Outdated Show resolved Hide resolved

refactor retriable client to use cached cclient to be more optimized

021fee7

Signed-off-by: UnicornChance <[email protected]>

slaskawi approved these changes Sep 12, 2025

View reviewed changes

services/src/main/java/org/keycloak/connections/httpclient/DefaultHttpClientFactory.java Outdated Show resolved Hide resolved

implement equals override to simplify equality of httpclients

fb07239

Signed-off-by: UnicornChance <[email protected]>

rmartinc reviewed Sep 12, 2025

View reviewed changes

mposolda reviewed Sep 15, 2025

View reviewed changes

mposolda self-assigned this Sep 15, 2025

stianst reviewed Oct 15, 2025

View reviewed changes

stianst requested changes Oct 15, 2025

View reviewed changes

slaskawi reviewed Oct 17, 2025

View reviewed changes

chance-coleman added 2 commits October 17, 2025 07:28

slaskawi reviewed Oct 20, 2025

View reviewed changes

slaskawi approved these changes Oct 21, 2025

View reviewed changes

chance-coleman added 2 commits October 21, 2025 07:27

address feedback for more documentation and cleanup

397a50b

Signed-off-by: UnicornChance <[email protected]>

fix whitespace

9d1b248

Signed-off-by: UnicornChance <[email protected]>

chance-coleman force-pushed the retry-httpclient-implementation branch from e865770 to 9d1b248 Compare October 21, 2025 13:28

Merge branch 'main' into retry-httpclient-implementation

6b9eac6

stianst reviewed Oct 24, 2025

View reviewed changes

add doc callout about the danger of not configuring the timeout appro…

e202cdf

…priately Signed-off-by: UnicornChance <[email protected]>

Add configurable HTTP retry mechanism for OCSP validation #42535

Are you sure you want to change the base?

Add configurable HTTP retry mechanism for OCSP validation #42535

Uh oh!

Conversation

chance-coleman commented Sep 10, 2025

Description

Motivation

Uh oh!

Uh oh!

slaskawi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rmartinc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mposolda Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chance-coleman Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mposolda left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mposolda Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaskawi commented Sep 29, 2025

Uh oh!

slaskawi commented Oct 6, 2025

Uh oh!

stianst left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stianst left a comment

Choose a reason for hiding this comment

Uh oh!

slaskawi commented Oct 15, 2025

Uh oh!

stianst commented Oct 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slaskawi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mposolda Sep 15, 2025 •

edited

Loading

chance-coleman Sep 22, 2025 •

edited

Loading

mposolda Sep 15, 2025 •

edited

Loading

stianst left a comment •

edited

Loading