Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

Failing node leads to breakdown of cluster #46298

@Madeveda

Description

@Madeveda

Before reporting an issue

  • I have read and understood the above terms for submitting issues, and I understand that my issue may be closed without action if I do not follow them.

Area

infinispan

Describe the bug

We run a single-cluster deployment using a StatefulSet with 5 (in the past 3) replicas. When the underlying k8s-node of one of the replicas becomes NotReady (possibly because of other workloads on the k8s node), the pod is stuck in "Terminating". The other nodes experience Timeouts to the pod running on this node (as expected).

After 3 minutes, the failed node leaves the keycloak cluster and a new cluster view is created. Instead of the cluster returning to normal operation with the remaining nodes, communication between them breaks down (resulting in Timeouts) and them ultimately reporting as not healthy.

Version

26.5.2

Regression

  • The issue is a regression

Expected behavior

Cluster resumes normal cluster operation after one node leaves the cluster.

Actual behavior

Logs after initial node fails:

2026-02-12 16:43:44.719errorISPN000476: Timed out waiting for responses for request 3264119 from keycloak-k1-3-40412 after 15 secondsISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12 16:43:44.720errorISPN000476: Timed out waiting for responses for request 3264119 from keycloak-k1-3-40412 after 15 secondsUncaught server error
2026-02-12 16:43:48.890errorISPN000476: Timed out waiting for responses for request 3264126 from keycloak-k1-3-40412 after 15 secondsISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12 16:43:50.720errorISPN000476: Timed out waiting for responses for request 3264133 from keycloak-k1-3-40412 after 15 secondsISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12 16:43:52.955errorISPN000427: Timeout after 15 seconds waiting for acks ([keycloak-k1-3-40412]). Id=3006, Topology Id=332ISPN000136: Error executing command PutKeyValueCommand on Cache 'authenticationSessions', writing keys [PqtGNanhtEt4WwzoPbFafyMs]
2026-02-12 16:43:52.956traceorg.infinispan.commons.TimeoutException: ISPN000427: Timeout after 15 seconds waiting for acks ([keycloak-k1-3-40412]). Id=3006, Topology Id=332Uncaught server error
...

Logs after new cluster view:

...
2026-02-12T15:46:45.806Z	 ISPN000094: Received new cluster view for channel ISPN: [keycloak-k1-0-2128|75] (4) [keycloak-k1-0-2128, keycloak-k1-1-24929, keycloak-k1-2-59615, keycloak-k1-4-56737]
2026-02-12T15:46:45.806Z	 Reloading JGroups Certificate
2026-02-12T15:46:45.823Z	 ISPN100001: Node keycloak-k1-3-40412 left the cluster
2026-02-12T15:46:45.823Z	 ISPN100001: Node keycloak-k1-3-40412 left the cluster
2026-02-12T15:47:00.604Z	ISPN000476: Timed out waiting for responses for request 3266982 from keycloak-k1-2-59615 after 14.25 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'actionTokens', writing keys []
2026-02-12T15:47:00.604Z	ISPN000476: Timed out waiting for responses for request 3266982 from keycloak-k1-2-59615 after 14.25 seconds Uncaught server error
2026-02-12T15:47:00.913Z	ISPN000476: Timed out waiting for responses for request 3267026 from keycloak-k1-2-59615 after 15 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12T15:47:00.917Z	ISPN000476: Timed out waiting for responses for request 3267028 from keycloak-k1-2-59615 after 15 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12T15:47:00.927Z	ISPN000476: Timed out waiting for responses for request 3267031 from keycloak-k1-2-59615 after 15 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12T15:47:00.932Z	ISPN000476: Timed out waiting for responses for request 3267032 from keycloak-k1-2-59615 after 15 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12T15:47:00.933Z	ISPN000476: Timed out waiting for responses for request 3267033 from keycloak-k1-2-59615 after 15 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12T15:47:00.951Z	ISPN000476: Timed out waiting for responses for request 3267037 from keycloak-k1-2-59615 after 15 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12T15:47:00.951Z	ISPN000476: Timed out waiting for responses for request 3267038 from keycloak-k1-2-59615 after 14.96 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12T15:47:00.980Z	ISPN000476: Timed out waiting for responses for request 3267040 from keycloak-k1-2-59615 after 15 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12T15:47:00.983Z	ISPN000476: Timed out waiting for responses for request 3267041 from keycloak-k1-2-59615 after 15 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12T15:47:00.983Z	ISPN000476: Timed out waiting for responses for request 3267042 from keycloak-k1-1-24929 after 15 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12T15:47:00.986Z	ISPN000476: Timed out waiting for responses for request 3267043 from keycloak-k1-1-24929 after 15 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12T15:47:00.986Z	ISPN000476: Timed out waiting for responses for request 3267044 from keycloak-k1-1-24929 after 15 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12T15:47:00.987Z	ISPN000476: Timed out waiting for responses for request 3267045 from keycloak-k1-2-59615 after 15 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12T15:47:00.990Z	ISPN000476: Timed out waiting for responses for request 3267046 from keycloak-k1-1-24929 after 15 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12T15:47:00.992Z	ISPN000476: Timed out waiting for responses for request 3267047 from keycloak-k1-0-2128 after 15 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12T15:47:00.997Z	ISPN000476: Timed out waiting for responses for request 3267048 from keycloak-k1-2-59615 after 15 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12T15:47:00.997Z	ISPN000476: Timed out waiting for responses for request 3267049 from keycloak-k1-2-59615 after 15 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12T15:47:00.997Z	ISPN000476: Timed out waiting for responses for request 3267050 from keycloak-k1-0-2128 after 15 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
2026-02-12T15:47:00.997Z	ISPN000476: Timed out waiting for responses for request 3267051 from keycloak-k1-0-2128 after 15 seconds ISPN000136: Error executing command GetKeyValueCommand on Cache 'clientSessions', writing keys []
...

Cluster reporting as not healthy after another 40 seconds:

2026-02-12 16:47:40.987infoSRHCK01001: Reporting health down status: {"status":"DOWN","checks":[{"name":"Keycloak cluster health check","status":"DOWN","data":{"Failing since":"2026-02-12 15:47:40,984"}},{"name":"Keycloak database connections async health check","status":"UP"}]}

A "rollout restart" of the Statefulset fixes the cluster state.

How to Reproduce?

Stateful set with multiple replicas (occured with both 3 and 5).
Environment:
KC_CACHE: ispn
KC_CACHE_STACK: jdbc-ping

Happened multiple times, however not clear how to reliably reproduce.

Anything else?

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions