-
Bug
-
Resolution: Done
-
Major
-
8.2.0.Beta2
The test starts 3 nodes concurrently, but configures Infinispan to wait for a cluster of 4 nodes, and expects that the nodes fail to start in initialClusterTimeout + 1 second.
However, because of a bug in TEST_PING, the first 2 nodes see each other as coordinator and send a JOIN request to each other, and it takes 3 seconds to recover and start the cluster properly.
The bug in TEST_PING is actually a hack introduced for ISPN-5106. The problem was that the first node (A) to start would install a view with itself as the single node, but the second node to start (B) would start immediately, and the discovery request from B would reach B's TEST_PING before it saw the view. That way, B could choose itself as the coordinator based on the order of A's and B's UUIDs, and the cluster would start as 2 partitions. Since most of our tests actually remove MERGE3 from the protocol stack, the partitions would never merge and the test would fail with a timeout.
I fixed this in TEST_PING by assuming that the sender of the first discovery response is a coordinator, when there is a single response. This worked because all but a few tests start their managers sequentially, however it sometimes introduces this 3 seconds delay when nodes start in parallel.