-
Bug
-
Resolution: Unresolved
-
Blocker
-
None
-
None
-
False
-
None
-
False
-
-
This issue can be reproduced within Keycloak cluster tests but it happens very rarely. The following is an artificial test created to reproduce it but with regular keycloak cluster tests this happens ~once in 100 clustering workflow runs.
The test is doing the following steps.
1. Before the test node1 and node2 are started
2. Kill node1
3. Start node1
4. Check cluster is formed
steps 2-4 are then repeated 100 times.
Here is the source code of the test.
Here are the settings for JDBC_PING2
Used database: Postgres
The test was running with log level set to TRACE for org.jgroups.protocols. The whole log is attached to the Jira.
From my perspective here is the relevant part of the log:
Iteration 30 start ---------- Killing node 14:42:41,058 TRACE [org.jgroups.protocols.TCP] (TQ-Bundler-5,node2) 172.17.0.1:7801: connecting to 172.17.0.1:7800 ... 2025-03-27 14:42:41,062 INFO [org.infinispan.CLUSTER] () ISPN000094: Received new cluster view for channel ISPN: [node2|62] (1) [node2] ... 14:42:41,240 TRACE [org.jgroups.protocols.TCP] (TQ-Bundler-5,node2) JGRP000029: node2: failed sending message to node1 (61 bytes): java.net.ConnectException: Connection refused, headers: UNICAST3: ACK, seqno=83, ts=175, TP: [cluster=ISPN] 14:42:41,440 TRACE [org.jgroups.protocols.UNICAST3] () node2 --> node1: resending(#54) ... ---------- Node killed Wait: 100ms ---------- Starting node ... 14:42:42,242 TRACE [org.jgroups.protocols.TCP] (TQ-Bundler-5,node2) JGRP000029: node2: failed sending message to node1 (74 bytes): java.net.ConnectException: Connection refused, headers: RequestCorrelator: corr_id=0, type=RSP, req_id=44, rsp_expected=true, UNICAST3: DATA, seqno=54, conn_id=31, TP: [cluster=ISPN] 14:42:42,442 TRACE [org.jgroups.protocols.UNICAST3] () node2 --> node1: resending(#54) ... // node2 is trying to resend the message a few more times ... 14:42:47,739 INFO [org.jgroups.JChannel] (main) local_addr: 80a3bbef-c294-4250-9f21-b9142b218ca3, name: node1 14:42:47,746 TRACE [org.jgroups.protocols.JDBC_PING2] () node2: SQL for deletion: wrapped[ DELETE from JGROUPS_PING WHERE address=('uuid://861cd8e9-2c98-4fcd-914c-8ed053306acf') ] ... 14:42:47,747 TRACE [org.jgroups.protocols.JDBC_PING2] () node2: SQL for insertion: wrapped[ INSERT INTO JGROUPS_PING values (('uuid://861cd8e9-2c98-4fcd-914c-8ed053306acf'), ('node2'), ('ISPN'), ('172.17.0.1:7801'), ('TRUE'::boolean)) ] ... 14:42:47,751 TRACE [org.jgroups.protocols.JDBC_PING2] (main) node1: SQL for reading: wrapped[ SELECT address, name, ip, coord FROM JGROUPS_PING WHERE cluster_name=('ISPN') ] ... 14:42:47,752 TRACE [org.jgroups.protocols.JDBC_PING2] (main) node1: SQL for insertion: wrapped[ INSERT INTO JGROUPS_PING values (('uuid://80a3bbef-c294-4250-9f21-b9142b218ca3'), ('node1'), ('ISPN'), ('172.17.0.1:7800'), ('FALSE'::boolean)) ] ... 14:42:47,754 INFO [org.jgroups.protocols.pbcast.GMS] (main) node1: no members discovered after 4 ms: creating cluster as coordinator ... 14:42:47,770 TRACE [org.jgroups.protocols.JDBC_PING2] (main) node1: SQL for insertion: wrapped[ INSERT INTO JGROUPS_PING values (('uuid://80a3bbef-c294-4250-9f21-b9142b218ca3'), ('node1'), ('ISPN'), ('172.17.0.1:7800'), ('TRUE'::boolean)) ] ... 14:42:47,774 INFO [org.infinispan.CLUSTER] (main) ISPN000094: Received new cluster view for channel ISPN: [node1|0] (1) [node1]
It seems the following is happening:
1. node2 starts sending a message to node1
2. The test kills node1
3. The message from 1. is not received because node1 is killed but node2 repeats sending the message during next steps
4. The test starts node1
5. node2 gives up sending the message from step1 and during ?exception handling? the jdbc ping table is updated in two steps. First it deletes itself from the table
6. the node1 is starting up and checks for other nodes in the cluster by checking the jdbc ping table and it fails to find existing members so it insert itself as coordinator
7. node2 insert itself as coordinator
8. Both nodes think they are coordinators of the cluster
Note from the attached log file I removed iteration 0-28 because they passed correctly. I included iteration 29 as it may contain some important log messages but the failed iteration is iteration 30.
rhn-engineering-bban We've managed to reproduce this just using JGroups channels. I have created a reproducer in the Infinispan repository so that DBs are automatically provisioned, but other than that it's just using JGroups code. Here's my branch: https://github.com/ryanemerson/infinispan/tree/jdbc_ping2_split_brain
To execute the tests run:
Setting remove_all_data_on_view_change=false results in the number of iterations required for the test to fail to increase, but the failure still happens consistently.
Trace logs are enabled for org.jgroups.protocols when logging to a file and the files can be found in server/tests/target