-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Add configurable HTTP retry mechanism for OCSP validation #42535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add configurable HTTP retry mechanism for OCSP validation #42535
Conversation
Implements a configurable retry mechanism for OCSP certificate validation with UI configuration in the X.509 client certificate authenticator. Also adds a general-purpose HTTP retry mechanism with exponential backoff and jitter that can be used throughout the codebase. The implementation is opt-in by default (0 retries) to maintain backward compatibility while allowing configuration when needed. Closes keycloak#42401 Signed-off-by: UnicornChance <[email protected]> cleanup Signed-off-by: UnicornChance <[email protected]> cleanup Signed-off-by: UnicornChance <[email protected]> cleanup Signed-off-by: UnicornChance <[email protected]> cleanup Signed-off-by: UnicornChance <[email protected]>
services/src/main/java/org/keycloak/connections/httpclient/DefaultHttpClientFactory.java
Outdated
Show resolved
Hide resolved
Signed-off-by: UnicornChance <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a small nitpick, which is up to you to implement.
services/src/main/java/org/keycloak/connections/httpclient/DefaultHttpClientFactory.java
Outdated
Show resolved
Hide resolved
Signed-off-by: UnicornChance <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @chance-coleman for the PR!
I have added two comments but I want that @mposolda also checks this. So do not start changing code until he also reviews the PR and gives his opinion.
| } | ||
|
|
||
| @Override | ||
| public CloseableHttpClient getRetriableHttpClient(RetryConfig retryConfig) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I thought about this my initial idea was adding just one retriable configuration for everything. So this method would be getRetriableHttpClient() and the retry options are just defined in this factory as options (similar to other options for proxy or similar in the normal client). This would simplify the configuration and would remove the need to add the UI options for the places that we decide to use the retry. This way adding this to CRL for example is just changing the method without adding any new configuration options. Besides it avoids a lot of the burden in this PR. WDYT?
The same configuration should be used for both clients, but the retriable one adds the specifics for the retry. There are only two clients defined that are initialize only once each. By default (if no retry is configured) the same client is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh i can see how this would be simpler and more consistent. the only downside i can think of here is less flexibility in how retries would be approached. I'm open to this implementation if that is what you all would prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chance-coleman has a point here in my opinion. Depending on the area of the code, we may want to use different setting. In other words, what will work correctly for OCSP might not be a good fit for Identity Providers or other functionalities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this is just a matter of what we prefer. Let's see what other people think aboy this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for just getRetriableHttpClient() . If we ever need different configuration for different "use-cases" of HTTP client, we can add new method if needed.
I can imagine that we may want different configurations for different use-cases, however this might mean also different configurations for other aspects than just retry configuration. For more flexibility, maybe the best is to have the method like:
@Override
public CloseableHttpClient getHttpClient(String context) {
where context is for example something like "ocsp" or "facebook-idp" . Then in the configuration of Http-client provider, we can add namespaces for various configuration options. For example we can have ocsp--proxy-mappings if we want to have different proxy-mappings configuration for the case when HTTP client is called in the context of "ocsp"
. This will also allow to have different retry configurations for "ocsp" and different for other things. The Config.Scope has some support for namespaces in itself AFAIK.
I do not insist on this, but would prefer something like this instead of introducing getRetriableHttpClient(RetryConfig) , which IMO does not have much flexibility and at the same time, it is not used anywhere right now.
The getRetriableHttpClient() is also less ideal than namespaces IMO, but maybe good compromise if others prefer this instead of introducing "namespaces" .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mposolda i can see the benefits in this approach. The context-based approach is more flexible and maintainable in the long run. The main trade-off is slightly more complex configuration. It's definitely a balance and some preference, ultimately it's your codebase so i'm happy to make changes either way!
| (Math.random() * retryConfig.getJitterFactor() * 2.0); | ||
| delay = (long)(baseDelay * jitter); | ||
| } | ||
| Thread.sleep(delay); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like this sleep. What is the reason for this? Normal DefaultHttpRequestRetryHandler is not enough for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm good question, my understanding was that Apache HttpClient 4.5.14 DefaultHttpRequestRetryHandler does not implement any backoff or delay between retries. The retry logic simply checks if we should retry and immediately returns the decision.
The sleep in the code was necessary to implement the exponential backoff with jitter functionality. Without it, the retries would happen in rapid succession, which i believe to be a more robust solution, although more complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also ServiceUnavailableRetryStrategy . Could not that one be used instead of implementing the retry logic in our own codebase? For example see the 2nd post from https://stackoverflow.com/questions/48541329/timeout-between-request-retries-apache-httpclient .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While ServiceUnavailableRetryStrategy is great for fine-tuning HTTP error responses, it's limited to HTTP-layer issues and doesn't handle lower-level network problems. Our current solution handles the full spectrum of failures; from TLS errors and dropped connections to server errors, in one straightforward implementation. This single-point approach ensures reliable retries whether the issue is in the network stack or the HTTP layer, without the complexity of managing multiple retry strategies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chance-coleman @rmartinc @slaskawi Added some additional review comments. WDYT?
| (Math.random() * retryConfig.getJitterFactor() * 2.0); | ||
| delay = (long)(baseDelay * jitter); | ||
| } | ||
| Thread.sleep(delay); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also ServiceUnavailableRetryStrategy . Could not that one be used instead of implementing the retry logic in our own codebase? For example see the 2nd post from https://stackoverflow.com/questions/48541329/timeout-between-request-retries-apache-httpclient .
| } | ||
|
|
||
| @Override | ||
| public CloseableHttpClient getRetriableHttpClient(RetryConfig retryConfig) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for just getRetriableHttpClient() . If we ever need different configuration for different "use-cases" of HTTP client, we can add new method if needed.
I can imagine that we may want different configurations for different use-cases, however this might mean also different configurations for other aspects than just retry configuration. For more flexibility, maybe the best is to have the method like:
@Override
public CloseableHttpClient getHttpClient(String context) {
where context is for example something like "ocsp" or "facebook-idp" . Then in the configuration of Http-client provider, we can add namespaces for various configuration options. For example we can have ocsp--proxy-mappings if we want to have different proxy-mappings configuration for the case when HTTP client is called in the context of "ocsp"
. This will also allow to have different retry configurations for "ocsp" and different for other things. The Config.Scope has some support for namespaces in itself AFAIK.
I do not insist on this, but would prefer something like this instead of introducing getRetriableHttpClient(RetryConfig) , which IMO does not have much flexibility and at the same time, it is not used anywhere right now.
The getRetriableHttpClient() is also less ideal than namespaces IMO, but maybe good compromise if others prefer this instead of introducing "namespaces" .
| .defaultValue(HttpClientProvider.DEFAULT_MAX_CONSUMED_RESPONSE_SIZE) | ||
| .add() | ||
| .property() | ||
| .name("http-client.default-max-retries") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why those configuration options have prefix http-client here? Other config options of the provider do not have this prefix.
Also there is one confusing aspect: When someone reads the documentation of this provider, he would see http-client-default-max-retries configured to 3, which imply that this is used by default by HTTP client. But AFAIK, this is used just if getRetriableHttpClient() is called right? By default, the getHttpClient() or other methods do not use this retry configuration at all and requests are not retried. Which is not clear from this docs...
Might be better if we have something like .name("default-max-retries") (which would be 0 by default and hence disabled retries) and .name("ocsp--default-max-retries") (which would be 3 and hence enabled retries for OCSP by default). I proposed in the other comment to use "namespaces" for the configurations. This will be likely harder to implement, but seems to me like more clear configuration.
@mposolda LGTM! (I already added my approval and together with @chance-coleman we addressed my comments before this PR has been created). |
|
Hey @mposolda 👋🏻 Could I ask you to have a look at this one? It seems @chance-coleman addressed all the comments and I think we should be close to getting it in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not convinced about this should be configured on the X509 authenticator specifically. Looking at the issue is explicitly mentions OCSP. However, if network traffic is unstable then this will affect any outgoing HTTP requests from Keycloak.
I would argue we should instead consider having config options for this for the HTTP client provider in general.
One thing to bear in mind here though if there are a few retries, and the timeout is large, then the original incoming HTTP request to Keycloak would likely have timed out in the meantime. To me that is another argument to make it a server wide configuration option, as incoming request timeout should be higher than the total of retries + timeouts for outgoing requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this, it really isn't something we should make configurable as fine-grained as within the X509 authenticator, but just make a server wide config option for outgoing http requests
@stianst Could you please tell us more about the reasoning behind this? When we were discussing this with @chance-coleman, we were thinking about minimizing the impact of introducing this configuration (so exactly opposite to your thoughts). The main argument here was that Keycloak can have multiple Realms, Orgs and each one of them might use OCSPs with different availability SLAs. Some of them might be flaky by nature (and we should retry on them) and some are expected to rock solid (and retrying them just increases a storm of requests and generally doesn't help bringing it back online). So I wanted to double check if we really would like to make this a server-level setting affecting all Realms, Orgs and all the egress traffic from Keycloak. |
What about client backchannel logout, brokering requests, etc.. They can also be flaky. Also, really don't think network settings is something a realm admin should care about, that's more the server admin (and person that manages the environment). I'd start with just a global setting; then perhaps add some sort of regex where you can configure what URLs should have retries and what should not. Again really think this is a server/environment config thing, and not something that should be managed by realm admins. |
Refactored retry behavior from per-feature configuration to a simpler server-wide approach based on reviewer feedback. This eliminates UI complexity and provides consistent retry behavior across all HTTP clients. Key changes: - Removed OCSP-specific retry configuration from X.509 authenticator UI - Simplified HttpClientProvider interface (removed parameterized method) - Consolidated retry configuration to DefaultHttpClientFactory - Updated documentation to reflect server-wide configuration - Renamed timeout properties to avoid conflicts with existing settings Configuration is now opt-in (max-retries=0 by default) and applies to all outgoing HTTP requests including OCSP validation, identity provider communication, and other external calls. Signed-off-by: UnicornChance <[email protected]>
server-spi-private/src/main/java/org/keycloak/connections/httpclient/HttpClientProvider.java
Outdated
Show resolved
Hide resolved
server-spi-private/src/main/java/org/keycloak/utils/OCSPProvider.java
Outdated
Show resolved
Hide resolved
.../keycloak/authentication/authenticators/x509/AbstractX509ClientCertificateAuthenticator.java
Outdated
Show resolved
Hide resolved
...ak/authentication/authenticators/x509/AbstractX509ClientCertificateAuthenticatorFactory.java
Show resolved
Hide resolved
services/src/main/java/org/keycloak/connections/httpclient/DefaultHttpClientFactory.java
Outdated
Show resolved
Hide resolved
services/src/main/java/org/keycloak/connections/httpclient/DefaultHttpClientFactory.java
Outdated
Show resolved
Hide resolved
services/src/main/java/org/keycloak/connections/httpclient/DefaultHttpClientFactory.java
Outdated
Show resolved
Hide resolved
…ation Refactored HTTP client retry implementation based on reviewer feedback to use a single HTTP client with built-in retry behavior instead of separate retriable and non-retriable clients. Signed-off-by: UnicornChance <[email protected]>
…ation Refactored HTTP client retry implementation based on reviewer feedback to use a single HTTP client with built-in retry behavior instead of separate retriable and non-retriable clients. Key changes: - Removed getRetriableHttpClient() method from HttpClientProvider interface - Merged retry functionality into the main httpClient in DefaultHttpClientFactory - Extracted retry configuration to configureRetries() helper method for cleaner code - Renamed retry-on-io-exception property to retry-on-error (more generic) - Removed per-feature retry configuration from X.509 authenticator - Updated OCSPProvider to use standard getHttpClient() method - Removed unused OCSP_CONNECT_TIMEOUT constant from OCSPProvider - Removed retry-specific timeout properties (using standard socket-timeout-millis) Signed-off-by: UnicornChance <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code LGTM! Requested only doc changes and a few nit fixes.
docs/documentation/server_admin/topics/authentication/x509.adoc
Outdated
Show resolved
Hide resolved
services/src/main/java/org/keycloak/connections/httpclient/DefaultHttpClientFactory.java
Outdated
Show resolved
Hide resolved
services/src/main/java/org/keycloak/connections/httpclient/DefaultHttpClientFactory.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for all the changes @chance-coleman !
@stianst This one is ready for re-review.
|
@chance-coleman I've just spotted - the DCO job is failing. You probably need to add a sign-off |
Signed-off-by: UnicornChance <[email protected]>
Signed-off-by: UnicornChance <[email protected]>
e865770 to
9d1b248
Compare
| *disable-trust-manager*:: | ||
| If an outgoing request requires HTTPS and this configuration option is set to true, you do not have to specify a truststore. This setting should be used only during development and *never in production* because it will disable verification of SSL certificates. Default: false. | ||
|
|
||
| == Configuring retry behavior for outgoing HTTP requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add something around the fact that outgoing request retries should not exceed the timeout for incoming requests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a callout like this:
IMPORTANT: Do not let outgoing retry duration exceed the caller’s timeout. Otherwise, the caller may time out and see an error while {project_name} continues retrying in the background.
…priately Signed-off-by: UnicornChance <[email protected]>
Description
Related Issue #42401
This PR adds two key improvements to Keycloak:
General-Purpose HTTP Retry Mechanism: Implements a general-purpose HTTP retry mechanism that allows automatic retrying of HTTP requests when they fail due to transient issues such as network timeouts, temporary server errors, or other recoverable conditions.
OCSP Retry UI Configuration: Adds UI configuration inputs for OCSP retry count and timeout settings in the X.509 client certificate authentication flow. This allows administrators to configure retry behavior for OCSP certificate validation through the Keycloak admin UI.
The implementation extends the existing
HttpClientProviderinterface to provide retriable HTTP clients that can be used throughout the codebase. This approach provides a consistent way to handle retries across different components, including OCSP certificate validation.Motivation
In environments with unreliable network connections or when interacting with services that may experience temporary outages, HTTP requests can fail unnecessarily. This is particularly problematic for critical operations like OCSP certificate validation, where a temporary network issue could prevent a user from authenticating.
By implementing an opt-in general-purpose retry mechanism, we can improve the reliability of Keycloak in these environments without changing default behavior or duplicating retry logic across different components.