Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@ryanemerson
Copy link
Contributor

@ryanemerson ryanemerson commented Jun 12, 2025

Closes #39429
Closes #40472

Test failures introduced by #39126

Previously model/infinispan/src/main/java/org/keycloak/connections/infinispan/DefaultInfinispanConnectionProviderFactory.java ensured that the caches were created with 2 owners if sessionsOwners was not explicitly configured. However, since those changes the test-ispn.xml configuration was loaded which utilises num_owners=1 for the offlineSessions cache.

When looking up a offlineSession not present in the cache, the InfinispanUserSessionProvider calls getUserSessionEntityFromPersistenceProvider which:

  1. Load the persistent session from UserSessionPersisterProvider
  2. Import it into memory using importUserSession
    2a. Write to the cache via session.sessions().importUserSessions
    2b. Retrieve the UserSessionEntity by reading from the cache and return

The test has become flaky as 2b is a cache miss if a rebalance occurs between 2a and 2b.

Sample log entries:

3131:10:00:33,666 DEBUG [org.keycloak.models.sessions.infinispan.changes.InfinispanChangelogBasedTransaction] (pool-4-thread-1) key c08a915e-41ba-414f-a3f4-26ed82e08e44 not found in updates
3134:10:00:33,666 DEBUG [org.keycloak.models.sessions.infinispan.changes.InfinispanChangelogBasedTransaction] (pool-4-thread-1) key c08a915e-41ba-414f-a3f4-26ed82e08e44 not found in cache
3135:10:00:33,666 DEBUG [org.keycloak.models.sessions.infinispan.InfinispanUserSessionProvider] (pool-4-thread-1) getUserSessionEntity id=c08a915e-41ba-414f-a3f4-26ed82e08e44 entityWrapper==null
3136:10:00:33,666 DEBUG [org.keycloak.models.sessions.infinispan.InfinispanUserSessionProvider] (pool-4-thread-1) Offline user-session not found in infinispan, attempting UserSessionPersisterProvider lookup for sessionId=c08a915e-41ba-414f-a3f4-26ed82e08e44
3145:10:00:33,668 DEBUG [org.keycloak.models.sessions.infinispan.InfinispanUserSessionProvider] (pool-4-thread-1) Attempting to import user-session for sessionId=c08a915e-41ba-414f-a3f4-26ed82e08e44 offline=true
3147:10:00:33,668 DEBUG [org.keycloak.models.sessions.infinispan.InfinispanUserSessionProvider] (pool-4-thread-1) importUserSession stream: c08a915e-41ba-414f-a3f4-26ed82e08e44 | true
3150:10:00:33,668 DEBUG [org.keycloak.models.sessions.infinispan.InfinispanUserSessionProvider] (pool-4-thread-1) key=c08a915e-41ba-414f-a3f4-26ed82e08e44 value=SessionEntityWrapper{version=eab6587e-b99c-496a-bce7-ef8806826137, entity=UserSessionEntity [id=c08a915e-41ba-414f-a3f4-26ed82e08e44, realm=727c9dce-19ac-44d6-bc6c-468466451ba4, lastSessionRefresh=1749718831, clients=[]], localMetadata={}}
3151:10:00:33,668 DEBUG [org.keycloak.models.sessions.infinispan.InfinispanUserSessionProvider] (pool-4-thread-1) importSessionsWithExpiration put id=c08a915e-41ba-414f-a3f4-26ed82e08e44
3155:10:00:33,669 DEBUG [org.keycloak.models.sessions.infinispan.InfinispanUserSessionProvider] (pool-4-thread-1) importSessionsWithExpiration after id=c08a915e-41ba-414f-a3f4-26ed82e08e44
3158:10:00:33,669 DEBUG [org.keycloak.models.sessions.infinispan.InfinispanUserSessionProvider] (pool-4-thread-1) after importUserSessions sessionId=c08a915e-41ba-414f-a3f4-26ed82e08e44 offline=true
3159:10:00:33,669 DEBUG [org.keycloak.models.sessions.infinispan.InfinispanUserSessionProvider] (pool-4-thread-1) user-session imported, trying another lookup for sessionId=c08a915e-41ba-414f-a3f4-26ed82e08e44 offline=true
3160:10:00:33,670 DEBUG [org.keycloak.models.sessions.infinispan.changes.InfinispanChangelogBasedTransaction] (pool-4-thread-1) key c08a915e-41ba-414f-a3f4-26ed82e08e44 not found in updates
3162:10:00:33,670 WARN  [org.infinispan.CLUSTER] (pool-4-thread-2) [Context=offlineSessions] ISPN000312: Lost data because of graceful leaver node-3
3170:10:00:33,671 INFO  [org.infinispan.LIFECYCLE] () [Context=offlineSessions] ISPN100002: Starting rebalance with members [node-4, node-2], phase READ_OLD_WRITE_ALL, topology id 11
3171:10:00:33,671 INFO  [org.infinispan.LIFECYCLE] () [Context=offlineSessions] ISPN100002: Starting rebalance with members [node-4, node-2], phase READ_OLD_WRITE_ALL, topology id 11
3172:10:00:33,671 INFO  [org.infinispan.LIFECYCLE] (non-blocking-thread-node-2-p17-t17) [Context=offlineSessions] ISPN100010: Finished rebalance with members [node-4, node-2], topology id 11
3173:10:00:33,671 INFO  [org.infinispan.LIFECYCLE] (non-blocking-thread-node-4-p12-t38) [Context=offlineSessions] ISPN100010: Finished rebalance with members [node-4, node-2], topology id 11
3174:10:00:33,671 DEBUG [org.keycloak.models.sessions.infinispan.changes.InfinispanChangelogBasedTransaction] (pool-4-thread-1) key c08a915e-41ba-414f-a3f4-26ed82e08e44 not found in cache
3175:10:00:33,671 DEBUG [org.keycloak.models.sessions.infinispan.InfinispanUserSessionProvider] (pool-4-thread-1) getUserSessionEntity id=c08a915e-41ba-414f-a3f4-26ed82e08e44 entityWrapper==null
3176:10:00:33,672 DEBUG [org.keycloak.models.sessions.infinispan.InfinispanUserSessionProvider] (pool-4-thread-1) user-session could not be found after import for sessionId=c08a915e-41ba-414f-a3f4-26ed82e08e44 offline=true

The solution is to ensure that num_owners=2 is always defined for volatilesession tests like we recommend for users in their configurations.

I have executed the test 100 times consecutively without failure locally, without the fix it would consistently fail before the 20th iteration.

@pruivo
Copy link
Member

pruivo commented Jun 12, 2025

@ahus1 we changed the defaults when we enabled persistent user sessions features here:

af53af1#diff-6003597969d1c5eba2b531cb775d0fa212a6e1ca68a9dd5ae4a2ae79cdf7cf22

Based on Ryan's findings, it seems an unsafe configuration. If a user is using volatile sessions, the user will end up with caches with num-owner=1 and incur data loss.

Instead of creating a new XML for the test, we would need to change the default back to 2. This also affects KC 26.2.

@ahus1
Copy link
Contributor

ahus1 commented Jun 12, 2025

Thank you for the analysis. I didn't realize back then that one owner would be problematic during rebalancing.

When we implemented it, we added to https://www.keycloak.org/server/caching the following statement:

Change owners attribute of the distributed-cache tag to 2 or more

This test is obviously not doing it, and we also make it very complicated for users to do the right thing.

We already have a check that prints a warning here: "Number of owners is one for cache %s, and no persistence is configured."

if (builder.memory().maxCount() == 10000 && (name.equals(USER_SESSION_CACHE_NAME) || name.equals(CLIENT_SESSION_CACHE_NAME))) {
logger.warnf("Persistent user sessions disabled and memory limit is set to default value 10000. Ignoring cache limits to avoid losing sessions for cache %s.", name);
builder.memory().maxCount(-1);
}
if (builder.clustering().hash().attributes().attribute(HashConfiguration.NUM_OWNERS).get() == 1 && builder.persistence().stores().isEmpty()) {
logger.warnf("Number of owners is one for cache %s, and no persistence is configured. This might be a misconfiguration as you will lose data when a single node is restarted!", name);
}

Instead having the manual instructions, and the extra XML file, I suggest to automatically update one owner to be two owners, and reduce the documentation by that manual procedure.

Let me know your thoughts. If you think an additional discussion is needed, please schedule a meeting for next week, for example Monday.

@ahus1
Copy link
Contributor

ahus1 commented Jun 12, 2025

Another thought: Writing to the cache and reading back is a strange implementation anyway. Still I hesitate to change this in the old code. Let's talk next week if this should be touched as well.

@pruivo
Copy link
Member

pruivo commented Jun 13, 2025

I didn't realize back then that one owner would be problematic during rebalancing.

Me neither. The only scenario I have in mind is when the originator has an old topology and requests the read to a node that is no longer the owner. It is worth investigating; I'll try to check.

I suggest to automatically update one owner to be two owners, and reduce the documentation by that manual procedure.

I'm ok with it, but I bet there will be a user, who prioritizes speed over consistency, complaining they can no longer user owner=1

Another thought: Writing to the cache and reading back is a strange implementation anyway. Still I hesitate to change this in the old code. Let's talk next week if this should be touched as well.

The importUserSessions returns void; maybe it can be improved 🤷
The method is deprecated for removal.

@ryanemerson
Copy link
Contributor Author

Me neither. The only scenario I have in mind is when the originator has an old topology and requests the read to a node that is no longer the owner. It is worth investigating; I'll try to check.

I discussed this with @jabolina. The issue in this case is that even though node-3 is gracefully leaving, we don't transfer it's state on graceful leave, so we lose the entry as it's only stored on node-3.

I suggest to automatically update one owner to be two owners, and reduce the documentation by that manual procedure.

I'm ok with it, but I bet there will be a user, who prioritizes speed over consistency, complaining they can no longer user owner=1

+1 to making the configuration always set num_owners=2 with volatile sessions. The user may prioritize speed, but it's at the expense of correctness and they maybe don't realise the implications of this configuration. I think we should cater for the majority of users here and see if "power" users have issues with this before reconsidering allowing num_owners=1 with volatile sessions.

@ryanemerson
Copy link
Contributor Author

I have pushed a commit to make it so that when volatile sessions are configured, we always configure at least num_owners=2 if no shared persistence store is present.

It's important that we distinguish between a shared and non-shared store, as a non-shared store will not prevent data loss even with a StatefulSet as we're not using global state and so restarted nodes will appear as new cluster members potential causing segments to be remapped on rebalance.

@ryanemerson ryanemerson force-pushed the 39429/OfflineSessionPersistenceTest_testPersistenceMultipleNodesClientSessionsAtRandomNode branch from 757f168 to d1a1084 Compare June 13, 2025 09:49
@pruivo
Copy link
Member

pruivo commented Jun 13, 2025

I discussed this with @jabolina. The issue in this case is that even though node-3 is gracefully leaving, we don't transfer it's state on graceful leave, so we lose the entry as it's only stored on node-3.

I'm surprised by this.

Copy link

@keycloak-github-bot keycloak-github-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unreported flaky test detected, please review

@keycloak-github-bot
Copy link

Unreported flaky test detected

If the flaky tests below are affected by the changes, please review and update the changes accordingly. Otherwise, a maintainer should report the flaky tests prior to merging the PR.

org.keycloak.testsuite.cluster.ClientScopeInvalidationClusterTest#crudWithFailover

Keycloak CI - Clustering IT

java.lang.RuntimeException: java.lang.IllegalStateException: Keycloak unexpectedly died :(
	at org.keycloak.testsuite.arquillian.containers.KeycloakQuarkusServerDeployableContainer.start(KeycloakQuarkusServerDeployableContainer.java:71)
	at org.jboss.arquillian.container.impl.ContainerImpl.start(ContainerImpl.java:185)
	at org.jboss.arquillian.container.impl.client.container.ContainerLifecycleController$8.perform(ContainerLifecycleController.java:137)
	at org.jboss.arquillian.container.impl.client.container.ContainerLifecycleController$8.perform(ContainerLifecycleController.java:133)
...

Report flaky test

org.keycloak.testsuite.cluster.RealmInvalidationClusterTest#crudWithFailover

Keycloak CI - Clustering IT

java.lang.RuntimeException: java.lang.IllegalStateException: Keycloak unexpectedly died :(
	at org.keycloak.testsuite.arquillian.containers.KeycloakQuarkusServerDeployableContainer.start(KeycloakQuarkusServerDeployableContainer.java:71)
	at org.jboss.arquillian.container.impl.ContainerImpl.start(ContainerImpl.java:185)
	at org.jboss.arquillian.container.impl.client.container.ContainerLifecycleController$8.perform(ContainerLifecycleController.java:137)
	at org.jboss.arquillian.container.impl.client.container.ContainerLifecycleController$8.perform(ContainerLifecycleController.java:133)
...

Report flaky test

@ryanemerson ryanemerson marked this pull request as ready for review June 13, 2025 10:41
@ryanemerson ryanemerson requested a review from a team as a code owner June 13, 2025 10:41
@ryanemerson ryanemerson force-pushed the 39429/OfflineSessionPersistenceTest_testPersistenceMultipleNodesClientSessionsAtRandomNode branch from d1a1084 to ab48883 Compare June 13, 2025 10:55
@ryanemerson
Copy link
Contributor Author

I have created #40472 to track this work as an enhancement and I have updated the commits to reflect this. IMO we should just apply this change from 26.3 onwards as the correct configuration required was documented previously.

@jabolina
Copy link

I've created a small gist [1] you can run with JBang and try the different configurations. You should be able to run with jbang https://gist.github.com/jabolina/3b1222441f40a3f2fd6ed5fdc1fbd9a3.

By default, it will create a cluster with 3 nodes with numOwners=1. It will remove some nodes and verify whether the entries are still present. You can check the gist for the other CLI options (global state, num entries, how many to remove, etc).

[1] https://gist.github.com/jabolina/3b1222441f40a3f2fd6ed5fdc1fbd9a3

@ahus1 ahus1 self-assigned this Jun 13, 2025
@ahus1
Copy link
Contributor

ahus1 commented Jun 16, 2025

I've approved this change. I as I wasn't involved in this discussions last week, I want to give @pruivo the opportunity to comment on this one as well.

I'm all in to merge this for 26.3. We can have a separate discussion on this is merged if this should be part of 26.2 as well.

@ahus1 ahus1 merged commit 78f575b into keycloak:main Jun 16, 2025
76 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

4 participants