Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-6239

InitialClusterSizeTest.testInitialClusterSizeFail random failures



      The test starts 3 nodes concurrently, but configures Infinispan to wait for a cluster of 4 nodes, and expects that the nodes fail to start in initialClusterTimeout + 1 second.

      However, because of a bug in TEST_PING, the first 2 nodes see each other as coordinator and send a JOIN request to each other, and it takes 3 seconds to recover and start the cluster properly.

      The bug in TEST_PING is actually a hack introduced for ISPN-5106. The problem was that the first node (A) to start would install a view with itself as the single node, but the second node to start (B) would start immediately, and the discovery request from B would reach B's TEST_PING before it saw the view. That way, B could choose itself as the coordinator based on the order of A's and B's UUIDs, and the cluster would start as 2 partitions. Since most of our tests actually remove MERGE3 from the protocol stack, the partitions would never merge and the test would fail with a timeout.

      I fixed this in TEST_PING by assuming that the sender of the first discovery response is a coordinator, when there is a single response. This worked because all but a few tests start their managers sequentially, however it sometimes introduces this 3 seconds delay when nodes start in parallel.

        Gliffy Diagrams


            Issue Links



                • Assignee:
                  dan.berindei Dan Berindei
                  dan.berindei Dan Berindei
                • Votes:
                  0 Vote for this issue
                  2 Start watching this issue


                  • Created: