Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2920

Concurrent starts with JDBC_PING2 lead to a split cluster

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 5.3.19, 5.5.0, 5.4.9
    • 5.3.18
    • None

      In environments where customers deploy Keycloak via AWS Fargate, we see situations where two Keycloak instances in a cluster register themselves as an Infinispan coordintor in the jgroups_ping table.

      The following examples are from a local docker compose example.

      kc1-1  | 2025-07-18 19:20:34,313 INFO  [org.jgroups.protocols.pbcast.GMS] (main) kc1-35658: no members discovered after 2 ms: creating cluster as coordinator
      kc2-1  | 2025-07-18 19:20:34,314 INFO  [org.jgroups.JChannel] (main) local_addr: 6fcea7e2-3e50-4239-88a2-e3a529cdd27a, name: kc2-31834
      kc2-1  | 2025-07-18 19:20:34,321 INFO  [org.jgroups.protocols.FD_SOCK2] (main) server listening on *:57800
      kc2-1  | 2025-07-18 19:20:34,324 INFO  [org.jgroups.protocols.pbcast.GMS] (main) kc2-31834: no members discovered after 2 ms: creating cluster as coordinator
      kc1-1  | 2025-07-18 19:20:34,325 INFO  [org.infinispan.CLUSTER] (main) ISPN000094: Received new cluster view for channel ISPN: [kc1-35658|0] (1) [kc1-35658]
      kc1-1  | 2025-07-18 19:20:34,327 INFO  [org.keycloak.jgroups.certificates.CertificateReloadManager] (main) Reloading JGroups Certificate
      kc2-1  | 2025-07-18 19:20:34,336 INFO  [org.infinispan.CLUSTER] (main) ISPN000094: Received new cluster view for channel ISPN: [kc2-31834|0] (1) [kc2-31834]
       

      This situation heals itself after a few seconds / (sometimes) minutes:

      kc1-1  | 2025-07-18 19:21:22,368 INFO  [org.infinispan.CLUSTER] () ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[kc1-35658|1] (2) [kc1-35658, kc2-31834], 2 subgroups: [kc1-35658|0] (1) [kc1-35658], [kc2-31834|0] (1) [kc2-31834]
      kc1-1  | 2025-07-18 19:21:22,368 INFO  [org.keycloak.jgroups.certificates.CertificateReloadManager] () Reloading JGroups Certificate
      kc1-1  | 2025-07-18 19:21:22,372 INFO  [org.infinispan.CLUSTER] () ISPN100000: Node kc2-31834 joined the cluster
      kc1-1  | 2025-07-18 19:21:22,372 INFO  [org.infinispan.CLUSTER] () ISPN100000: Node kc2-31834 joined the cluster
      kc2-1  | 2025-07-18 19:21:23,380 INFO  [org.infinispan.CLUSTER] () ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[kc1-35658|1] (2) [kc1-35658, kc2-31834], 2 subgroups: [kc1-35658|0] (1) [kc1-35658], [kc2-31834|0] (1) [kc2-31834]
      kc2-1  | 2025-07-18 19:21:23,380 INFO  [org.keycloak.jgroups.certificates.CertificateReloadManager] () Reloading JGroups Certificate
      kc2-1  | 2025-07-18 19:21:23,385 INFO  [org.infinispan.CLUSTER] () ISPN100000: Node kc1-35658 joined the cluster
      kc2-1  | 2025-07-18 19:21:23,386 INFO  [org.infinispan.CLUSTER] () ISPN100000: Node kc1-35658 joined the cluster
      

      The situation is worse when 4 nodes are started simultaneously, as we also see "no physical address, dropping message" in the logs.

      Proposed solution

      Without transaction in the JDBC_PING protocol it's not possible to fully prevent above scenarios, but we can reduce the chances of it happening by doing the following:

      1. On initial discovery read from DB, write local address and then re-read DB table until coordinator exists or subsequent ping data is the same as initial DB query. You can't weave two threads doing this without one thread reading both coordinator entries during discovery, so we are safe without adding additional database measures.
      2. When remove_all_data_on_view_change=true only remove addresses that are not part of the current view
      3. Call addDiscoveryResponseToCaches on view change to prevent "no physical address, dropping message".

              remerson@redhat.com Ryan Emerson
              remerson@redhat.com Ryan Emerson
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: