Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2870

Cluster is not correctly formed with JDBC_PING2

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Blocker Blocker
    • 5.4.6
    • None
    • None
    • False
    • None
    • False

      This issue can be reproduced within Keycloak cluster tests but it happens very rarely. The following is an artificial test created to reproduce it but with regular keycloak cluster tests this happens ~once in 100 clustering workflow runs.

      The test is doing the following steps.
      1. Before the test node1 and node2 are started
      2. Kill node1
      3. Start node1
      4. Check cluster is formed
      steps 2-4 are then repeated 100 times.

      Here is the source code of the test.
      Here are the settings for JDBC_PING2
      Used database: Postgres

      The test was running with log level set to TRACE for org.jgroups.protocols. The whole log is attached to the Jira.

      From my perspective here is the relevant part of the log:

      Iteration 30 start
      ---------- Killing node
      14:42:41,058 TRACE [org.jgroups.protocols.TCP] (TQ-Bundler-5,node2) 172.17.0.1:7801: connecting to 172.17.0.1:7800
      ...
      2025-03-27 14:42:41,062 INFO  [org.infinispan.CLUSTER] () ISPN000094: Received new cluster view for channel ISPN: [node2|62] (1) [node2]
      ...
      14:42:41,240 TRACE [org.jgroups.protocols.TCP] (TQ-Bundler-5,node2) JGRP000029: node2: failed sending message to node1 (61 bytes): java.net.ConnectException: Connection refused, headers: UNICAST3: ACK, seqno=83, ts=175, TP: [cluster=ISPN]
      14:42:41,440 TRACE [org.jgroups.protocols.UNICAST3] () node2 --> node1: resending(#54)
      ...
      ---------- Node killed
      Wait: 100ms
      ---------- Starting node
      ...
      14:42:42,242 TRACE [org.jgroups.protocols.TCP] (TQ-Bundler-5,node2) JGRP000029: node2: failed sending message to node1 (74 bytes): java.net.ConnectException: Connection refused, headers: RequestCorrelator: corr_id=0, type=RSP, req_id=44, rsp_expected=true, UNICAST3: DATA, seqno=54, conn_id=31, TP: [cluster=ISPN]
      14:42:42,442 TRACE [org.jgroups.protocols.UNICAST3] () node2 --> node1: resending(#54)
      ...
      // node2 is trying to resend the message a few more times
      ...
      14:42:47,739 INFO  [org.jgroups.JChannel] (main) local_addr: 80a3bbef-c294-4250-9f21-b9142b218ca3, name: node1
      14:42:47,746 TRACE [org.jgroups.protocols.JDBC_PING2] () node2: SQL for deletion: wrapped[ DELETE from JGROUPS_PING WHERE address=('uuid://861cd8e9-2c98-4fcd-914c-8ed053306acf') ]
      ...
      14:42:47,747 TRACE [org.jgroups.protocols.JDBC_PING2] () node2: SQL for insertion: wrapped[ INSERT INTO JGROUPS_PING values (('uuid://861cd8e9-2c98-4fcd-914c-8ed053306acf'), ('node2'), ('ISPN'), ('172.17.0.1:7801'), ('TRUE'::boolean)) ]
      ...
      14:42:47,751 TRACE [org.jgroups.protocols.JDBC_PING2] (main) node1: SQL for reading: wrapped[ SELECT address, name, ip, coord FROM JGROUPS_PING WHERE cluster_name=('ISPN') ]
      ...
      14:42:47,752 TRACE [org.jgroups.protocols.JDBC_PING2] (main) node1: SQL for insertion: wrapped[ INSERT INTO JGROUPS_PING values (('uuid://80a3bbef-c294-4250-9f21-b9142b218ca3'), ('node1'), ('ISPN'), ('172.17.0.1:7800'), ('FALSE'::boolean)) ]
      ...
      14:42:47,754 INFO  [org.jgroups.protocols.pbcast.GMS] (main) node1: no members discovered after 4 ms: creating cluster as coordinator
      ...
      14:42:47,770 TRACE [org.jgroups.protocols.JDBC_PING2] (main) node1: SQL for insertion: wrapped[ INSERT INTO JGROUPS_PING values (('uuid://80a3bbef-c294-4250-9f21-b9142b218ca3'), ('node1'), ('ISPN'), ('172.17.0.1:7800'), ('TRUE'::boolean)) ]
      ...
      14:42:47,774 INFO  [org.infinispan.CLUSTER] (main) ISPN000094: Received new cluster view for channel ISPN: [node1|0] (1) [node1]
      

      It seems the following is happening:
      1. node2 starts sending a message to node1
      2. The test kills node1
      3. The message from 1. is not received because node1 is killed but node2 repeats sending the message during next steps
      4. The test starts node1
      5. node2 gives up sending the message from step1 and during ?exception handling? the jdbc ping table is updated in two steps. First it deletes itself from the table
      6. the node1 is starting up and checks for other nodes in the cluster by checking the jdbc ping table and it fails to find existing members so it insert itself as coordinator
      7. node2 insert itself as coordinator
      8. Both nodes think they are coordinators of the cluster

      Note from the attached log file I removed iteration 0-28 because they passed correctly. I included iteration 29 as it may contain some important log messages but the failed iteration is iteration 30.

            [JGRP-2870] Cluster is not correctly formed with JDBC_PING2

            Ryan Emerson added a comment - - edited

            rhn-engineering-bban We've managed to reproduce this just using JGroups channels. I have created a reproducer in the Infinispan repository so that DBs are automatically provisioned, but other than that it's just using JGroups code. Here's my branch: https://github.com/ryanemerson/infinispan/tree/jdbc_ping2_split_brain

            To execute the tests run:

            mvn clean install -DskipTests -am -pl server/tests
            mvn verify -pl server/tests -Dorg.infinispan.test.database.types=postgres -Dit.test=JGroupsJdbcPing2IT#testClusterFormedAfterRestart
            

            Setting remove_all_data_on_view_change=false results in the number of iterations required for the test to fail to increase, but the failure still happens consistently.

            Trace logs are enabled for org.jgroups.protocols when logging to a file and the files can be found in server/tests/target

            Ryan Emerson added a comment - - edited rhn-engineering-bban We've managed to reproduce this just using JGroups channels. I have created a reproducer in the Infinispan repository so that DBs are automatically provisioned, but other than that it's just using JGroups code. Here's my branch: https://github.com/ryanemerson/infinispan/tree/jdbc_ping2_split_brain To execute the tests run: mvn clean install -DskipTests -am -pl server/tests mvn verify -pl server/tests -Dorg.infinispan.test.database.types=postgres -Dit.test=JGroupsJdbcPing2IT#testClusterFormedAfterRestart Setting remove_all_data_on_view_change=false results in the number of iterations required for the test to fail to increase, but the failure still happens consistently. Trace logs are enabled for org.jgroups.protocols when logging to a file and the files can be found in server/tests/target

            Bela Ban added a comment -

            This test should run in plan JGroups, so can you provide me with an XML configuration of JDBC_PING?

            Bela Ban added a comment - This test should run in plan JGroups, so can you provide me with an XML configuration of JDBC_PING ?

              rhn-engineering-bban Bela Ban
              mhajas@redhat.com Michal Hajas
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: