-
Bug
-
Resolution: Done
-
Major
-
5.3.18
-
None
-
False
-
-
False
-
-
In environments where customers deploy Keycloak via AWS Fargate, we see situations where two Keycloak instances in a cluster register themselves as an Infinispan coordintor in the jgroups_ping table.
The following examples are from a local docker compose example.
kc1-1 | 2025-07-18 19:20:34,313 INFO [org.jgroups.protocols.pbcast.GMS] (main) kc1-35658: no members discovered after 2 ms: creating cluster as coordinator kc2-1 | 2025-07-18 19:20:34,314 INFO [org.jgroups.JChannel] (main) local_addr: 6fcea7e2-3e50-4239-88a2-e3a529cdd27a, name: kc2-31834 kc2-1 | 2025-07-18 19:20:34,321 INFO [org.jgroups.protocols.FD_SOCK2] (main) server listening on *:57800 kc2-1 | 2025-07-18 19:20:34,324 INFO [org.jgroups.protocols.pbcast.GMS] (main) kc2-31834: no members discovered after 2 ms: creating cluster as coordinator kc1-1 | 2025-07-18 19:20:34,325 INFO [org.infinispan.CLUSTER] (main) ISPN000094: Received new cluster view for channel ISPN: [kc1-35658|0] (1) [kc1-35658] kc1-1 | 2025-07-18 19:20:34,327 INFO [org.keycloak.jgroups.certificates.CertificateReloadManager] (main) Reloading JGroups Certificate kc2-1 | 2025-07-18 19:20:34,336 INFO [org.infinispan.CLUSTER] (main) ISPN000094: Received new cluster view for channel ISPN: [kc2-31834|0] (1) [kc2-31834]
This situation heals itself after a few seconds / (sometimes) minutes:
kc1-1 | 2025-07-18 19:21:22,368 INFO [org.infinispan.CLUSTER] () ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[kc1-35658|1] (2) [kc1-35658, kc2-31834], 2 subgroups: [kc1-35658|0] (1) [kc1-35658], [kc2-31834|0] (1) [kc2-31834] kc1-1 | 2025-07-18 19:21:22,368 INFO [org.keycloak.jgroups.certificates.CertificateReloadManager] () Reloading JGroups Certificate kc1-1 | 2025-07-18 19:21:22,372 INFO [org.infinispan.CLUSTER] () ISPN100000: Node kc2-31834 joined the cluster kc1-1 | 2025-07-18 19:21:22,372 INFO [org.infinispan.CLUSTER] () ISPN100000: Node kc2-31834 joined the cluster kc2-1 | 2025-07-18 19:21:23,380 INFO [org.infinispan.CLUSTER] () ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[kc1-35658|1] (2) [kc1-35658, kc2-31834], 2 subgroups: [kc1-35658|0] (1) [kc1-35658], [kc2-31834|0] (1) [kc2-31834] kc2-1 | 2025-07-18 19:21:23,380 INFO [org.keycloak.jgroups.certificates.CertificateReloadManager] () Reloading JGroups Certificate kc2-1 | 2025-07-18 19:21:23,385 INFO [org.infinispan.CLUSTER] () ISPN100000: Node kc1-35658 joined the cluster kc2-1 | 2025-07-18 19:21:23,386 INFO [org.infinispan.CLUSTER] () ISPN100000: Node kc1-35658 joined the cluster
The situation is worse when 4 nodes are started simultaneously, as we also see "no physical address, dropping message" in the logs.
Proposed solution
Without transaction in the JDBC_PING protocol it's not possible to fully prevent above scenarios, but we can reduce the chances of it happening by doing the following:
1. On initial discovery read from DB, write local address and then re-read DB table until coordinator exists or subsequent ping data is the same as initial DB query. You can't weave two threads doing this without one thread reading both coordinator entries during discovery, so we are safe without adding additional database measures.
2. When remove_all_data_on_view_change=true only remove addresses that are not part of the current view
3. Call addDiscoveryResponseToCaches on view change to prevent "no physical address, dropping message".
- links to