Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-6239

InitialClusterSizeTest.testInitialClusterSizeFail random failures

    Details

      Description

      The test starts 3 nodes concurrently, but configures Infinispan to wait for a cluster of 4 nodes, and expects that the nodes fail to start in initialClusterTimeout + 1 second.

      However, because of a bug in TEST_PING, the first 2 nodes see each other as coordinator and send a JOIN request to each other, and it takes 3 seconds to recover and start the cluster properly.

      The bug in TEST_PING is actually a hack introduced for ISPN-5106. The problem was that the first node (A) to start would install a view with itself as the single node, but the second node to start (B) would start immediately, and the discovery request from B would reach B's TEST_PING before it saw the view. That way, B could choose itself as the coordinator based on the order of A's and B's UUIDs, and the cluster would start as 2 partitions. Since most of our tests actually remove MERGE3 from the protocol stack, the partitions would never merge and the test would fail with a timeout.

      I fixed this in TEST_PING by assuming that the sender of the first discovery response is a coordinator, when there is a single response. This worked because all but a few tests start their managers sequentially, however it sometimes introduces this 3 seconds delay when nodes start in parallel.

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

                People

                • Assignee:
                  dan.berindei Dan Berindei
                  Reporter:
                  dan.berindei Dan Berindei
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  2 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved: