Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@chance-coleman
Copy link

Description

Related Issue #42401

This PR adds two key improvements to Keycloak:

  1. General-Purpose HTTP Retry Mechanism: Implements a general-purpose HTTP retry mechanism that allows automatic retrying of HTTP requests when they fail due to transient issues such as network timeouts, temporary server errors, or other recoverable conditions.

  2. OCSP Retry UI Configuration: Adds UI configuration inputs for OCSP retry count and timeout settings in the X.509 client certificate authentication flow. This allows administrators to configure retry behavior for OCSP certificate validation through the Keycloak admin UI.

The implementation extends the existing HttpClientProvider interface to provide retriable HTTP clients that can be used throughout the codebase. This approach provides a consistent way to handle retries across different components, including OCSP certificate validation.

Motivation

In environments with unreliable network connections or when interacting with services that may experience temporary outages, HTTP requests can fail unnecessarily. This is particularly problematic for critical operations like OCSP certificate validation, where a temporary network issue could prevent a user from authenticating.

By implementing an opt-in general-purpose retry mechanism, we can improve the reliability of Keycloak in these environments without changing default behavior or duplicating retry logic across different components.

Implements a configurable retry mechanism for OCSP certificate validation with UI configuration in the X.509 client certificate authenticator.

Also adds a general-purpose HTTP retry mechanism with exponential backoff and jitter that can be used throughout the codebase.

The implementation is opt-in by default (0 retries) to maintain backward compatibility while allowing configuration when needed.

Closes keycloak#42401

Signed-off-by: UnicornChance <[email protected]>

cleanup

Signed-off-by: UnicornChance <[email protected]>

cleanup

Signed-off-by: UnicornChance <[email protected]>

cleanup

Signed-off-by: UnicornChance <[email protected]>

cleanup

Signed-off-by: UnicornChance <[email protected]>
Copy link
Contributor

@slaskawi slaskawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a small nitpick, which is up to you to implement.

Copy link
Contributor

@rmartinc rmartinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chance-coleman for the PR!

I have added two comments but I want that @mposolda also checks this. So do not start changing code until he also reviews the PR and gives his opinion.

}

@Override
public CloseableHttpClient getRetriableHttpClient(RetryConfig retryConfig) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I thought about this my initial idea was adding just one retriable configuration for everything. So this method would be getRetriableHttpClient() and the retry options are just defined in this factory as options (similar to other options for proxy or similar in the normal client). This would simplify the configuration and would remove the need to add the UI options for the places that we decide to use the retry. This way adding this to CRL for example is just changing the method without adding any new configuration options. Besides it avoids a lot of the burden in this PR. WDYT?

The same configuration should be used for both clients, but the retriable one adds the specifics for the retry. There are only two clients defined that are initialize only once each. By default (if no retry is configured) the same client is used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh i can see how this would be simpler and more consistent. the only downside i can think of here is less flexibility in how retries would be approached. I'm open to this implementation if that is what you all would prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chance-coleman has a point here in my opinion. Depending on the area of the code, we may want to use different setting. In other words, what will work correctly for OCSP might not be a good fit for Identity Providers or other functionalities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is just a matter of what we prefer. Let's see what other people think aboy this.

Copy link
Contributor

@mposolda mposolda Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for just getRetriableHttpClient() . If we ever need different configuration for different "use-cases" of HTTP client, we can add new method if needed.

I can imagine that we may want different configurations for different use-cases, however this might mean also different configurations for other aspects than just retry configuration. For more flexibility, maybe the best is to have the method like:

            @Override
            public CloseableHttpClient getHttpClient(String context) {

where context is for example something like "ocsp" or "facebook-idp" . Then in the configuration of Http-client provider, we can add namespaces for various configuration options. For example we can have ocsp--proxy-mappings if we want to have different proxy-mappings configuration for the case when HTTP client is called in the context of "ocsp"
. This will also allow to have different retry configurations for "ocsp" and different for other things. The Config.Scope has some support for namespaces in itself AFAIK.

I do not insist on this, but would prefer something like this instead of introducing getRetriableHttpClient(RetryConfig) , which IMO does not have much flexibility and at the same time, it is not used anywhere right now.

The getRetriableHttpClient() is also less ideal than namespaces IMO, but maybe good compromise if others prefer this instead of introducing "namespaces" .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mposolda i can see the benefits in this approach. The context-based approach is more flexible and maintainable in the long run. The main trade-off is slightly more complex configuration. It's definitely a balance and some preference, ultimately it's your codebase so i'm happy to make changes either way!

(Math.random() * retryConfig.getJitterFactor() * 2.0);
delay = (long)(baseDelay * jitter);
}
Thread.sleep(delay);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this sleep. What is the reason for this? Normal DefaultHttpRequestRetryHandler is not enough for you?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm good question, my understanding was that Apache HttpClient 4.5.14 DefaultHttpRequestRetryHandler does not implement any backoff or delay between retries. The retry logic simply checks if we should retry and immediately returns the decision.

The sleep in the code was necessary to implement the exponential backoff with jitter functionality. Without it, the retries would happen in rapid succession, which i believe to be a more robust solution, although more complex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also ServiceUnavailableRetryStrategy . Could not that one be used instead of implementing the retry logic in our own codebase? For example see the 2nd post from https://stackoverflow.com/questions/48541329/timeout-between-request-retries-apache-httpclient .

Copy link
Author

@chance-coleman chance-coleman Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While ServiceUnavailableRetryStrategy is great for fine-tuning HTTP error responses, it's limited to HTTP-layer issues and doesn't handle lower-level network problems. Our current solution handles the full spectrum of failures; from TLS errors and dropped connections to server errors, in one straightforward implementation. This single-point approach ensures reliable retries whether the issue is in the network stack or the HTTP layer, without the complexity of managing multiple retry strategies.

Copy link
Contributor

@mposolda mposolda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chance-coleman @rmartinc @slaskawi Added some additional review comments. WDYT?

(Math.random() * retryConfig.getJitterFactor() * 2.0);
delay = (long)(baseDelay * jitter);
}
Thread.sleep(delay);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also ServiceUnavailableRetryStrategy . Could not that one be used instead of implementing the retry logic in our own codebase? For example see the 2nd post from https://stackoverflow.com/questions/48541329/timeout-between-request-retries-apache-httpclient .

}

@Override
public CloseableHttpClient getRetriableHttpClient(RetryConfig retryConfig) {
Copy link
Contributor

@mposolda mposolda Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for just getRetriableHttpClient() . If we ever need different configuration for different "use-cases" of HTTP client, we can add new method if needed.

I can imagine that we may want different configurations for different use-cases, however this might mean also different configurations for other aspects than just retry configuration. For more flexibility, maybe the best is to have the method like:

            @Override
            public CloseableHttpClient getHttpClient(String context) {

where context is for example something like "ocsp" or "facebook-idp" . Then in the configuration of Http-client provider, we can add namespaces for various configuration options. For example we can have ocsp--proxy-mappings if we want to have different proxy-mappings configuration for the case when HTTP client is called in the context of "ocsp"
. This will also allow to have different retry configurations for "ocsp" and different for other things. The Config.Scope has some support for namespaces in itself AFAIK.

I do not insist on this, but would prefer something like this instead of introducing getRetriableHttpClient(RetryConfig) , which IMO does not have much flexibility and at the same time, it is not used anywhere right now.

The getRetriableHttpClient() is also less ideal than namespaces IMO, but maybe good compromise if others prefer this instead of introducing "namespaces" .

.defaultValue(HttpClientProvider.DEFAULT_MAX_CONSUMED_RESPONSE_SIZE)
.add()
.property()
.name("http-client.default-max-retries")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why those configuration options have prefix http-client here? Other config options of the provider do not have this prefix.

Also there is one confusing aspect: When someone reads the documentation of this provider, he would see http-client-default-max-retries configured to 3, which imply that this is used by default by HTTP client. But AFAIK, this is used just if getRetriableHttpClient() is called right? By default, the getHttpClient() or other methods do not use this retry configuration at all and requests are not retried. Which is not clear from this docs...

Might be better if we have something like .name("default-max-retries") (which would be 0 by default and hence disabled retries) and .name("ocsp--default-max-retries") (which would be 3 and hence enabled retries for OCSP by default). I proposed in the other comment to use "namespaces" for the configurations. This will be likely harder to implement, but seems to me like more clear configuration.

@mposolda mposolda self-assigned this Sep 15, 2025
@slaskawi
Copy link
Contributor

@chance-coleman @rmartinc @slaskawi Added some additional review comments. WDYT?

@mposolda LGTM! (I already added my approval and together with @chance-coleman we addressed my comments before this PR has been created).

@slaskawi
Copy link
Contributor

slaskawi commented Oct 6, 2025

Hey @mposolda 👋🏻

Could I ask you to have a look at this one? It seems @chance-coleman addressed all the comments and I think we should be close to getting it in.

Copy link
Contributor

@stianst stianst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not convinced about this should be configured on the X509 authenticator specifically. Looking at the issue is explicitly mentions OCSP. However, if network traffic is unstable then this will affect any outgoing HTTP requests from Keycloak.

I would argue we should instead consider having config options for this for the HTTP client provider in general.

One thing to bear in mind here though if there are a few retries, and the timeout is large, then the original incoming HTTP request to Keycloak would likely have timed out in the meantime. To me that is another argument to make it a server wide configuration option, as incoming request timeout should be higher than the total of retries + timeouts for outgoing requests.

@ahus1 @shawkins WDYT?

Copy link
Contributor

@stianst stianst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this, it really isn't something we should make configurable as fine-grained as within the X509 authenticator, but just make a server wide config option for outgoing http requests

@slaskawi
Copy link
Contributor

Thinking more about this, it really isn't something we should make configurable as fine-grained as within the X509 authenticator, but just make a server wide config option for outgoing http requests

@stianst Could you please tell us more about the reasoning behind this?

When we were discussing this with @chance-coleman, we were thinking about minimizing the impact of introducing this configuration (so exactly opposite to your thoughts). The main argument here was that Keycloak can have multiple Realms, Orgs and each one of them might use OCSPs with different availability SLAs. Some of them might be flaky by nature (and we should retry on them) and some are expected to rock solid (and retrying them just increases a storm of requests and generally doesn't help bringing it back online).

So I wanted to double check if we really would like to make this a server-level setting affecting all Realms, Orgs and all the egress traffic from Keycloak.

@stianst
Copy link
Contributor

stianst commented Oct 16, 2025

Thinking more about this, it really isn't something we should make configurable as fine-grained as within the X509 authenticator, but just make a server wide config option for outgoing http requests

@stianst Could you please tell us more about the reasoning behind this?

When we were discussing this with @chance-coleman, we were thinking about minimizing the impact of introducing this configuration (so exactly opposite to your thoughts). The main argument here was that Keycloak can have multiple Realms, Orgs and each one of them might use OCSPs with different availability SLAs. Some of them might be flaky by nature (and we should retry on them) and some are expected to rock solid (and retrying them just increases a storm of requests and generally doesn't help bringing it back online).

So I wanted to double check if we really would like to make this a server-level setting affecting all Realms, Orgs and all the egress traffic from Keycloak.

What about client backchannel logout, brokering requests, etc.. They can also be flaky. Also, really don't think network settings is something a realm admin should care about, that's more the server admin (and person that manages the environment).

I'd start with just a global setting; then perhaps add some sort of regex where you can configure what URLs should have retries and what should not. Again really think this is a server/environment config thing, and not something that should be managed by realm admins.

Refactored retry behavior from per-feature configuration to a simpler
server-wide approach based on reviewer feedback. This eliminates UI
complexity and provides consistent retry behavior across all HTTP clients.

Key changes:
- Removed OCSP-specific retry configuration from X.509 authenticator UI
- Simplified HttpClientProvider interface (removed parameterized method)
- Consolidated retry configuration to DefaultHttpClientFactory
- Updated documentation to reflect server-wide configuration
- Renamed timeout properties to avoid conflicts with existing settings

Configuration is now opt-in (max-retries=0 by default) and applies to
all outgoing HTTP requests including OCSP validation, identity provider
communication, and other external calls.

Signed-off-by: UnicornChance <[email protected]>
…ation

Refactored HTTP client retry implementation based on reviewer feedback to
use a single HTTP client with built-in retry behavior instead of separate
retriable and non-retriable clients.

Signed-off-by: UnicornChance <[email protected]>
…ation

Refactored HTTP client retry implementation based on reviewer feedback to
use a single HTTP client with built-in retry behavior instead of separate
retriable and non-retriable clients.

Key changes:
- Removed getRetriableHttpClient() method from HttpClientProvider interface
- Merged retry functionality into the main httpClient in DefaultHttpClientFactory
- Extracted retry configuration to configureRetries() helper method for cleaner code
- Renamed retry-on-io-exception property to retry-on-error (more generic)
- Removed per-feature retry configuration from X.509 authenticator
- Updated OCSPProvider to use standard getHttpClient() method
- Removed unused OCSP_CONNECT_TIMEOUT constant from OCSPProvider
- Removed retry-specific timeout properties (using standard socket-timeout-millis)

Signed-off-by: UnicornChance <[email protected]>
Copy link
Contributor

@slaskawi slaskawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code LGTM! Requested only doc changes and a few nit fixes.

Copy link
Contributor

@slaskawi slaskawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for all the changes @chance-coleman !

@stianst This one is ready for re-review.

@slaskawi
Copy link
Contributor

@chance-coleman I've just spotted - the DCO job is failing. You probably need to add a sign-off

@chance-coleman chance-coleman force-pushed the retry-httpclient-implementation branch from e865770 to 9d1b248 Compare October 21, 2025 13:28
*disable-trust-manager*::
If an outgoing request requires HTTPS and this configuration option is set to true, you do not have to specify a truststore. This setting should be used only during development and *never in production* because it will disable verification of SSL certificates. Default: false.

== Configuring retry behavior for outgoing HTTP requests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add something around the fact that outgoing request retries should not exceed the timeout for incoming requests

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a callout like this:

IMPORTANT: Do not let outgoing retry duration exceed the caller’s timeout. Otherwise, the caller may time out and see an error while {project_name} continues retrying in the background.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants