Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-6239

InitialClusterSizeTest.testInitialClusterSizeFail random failures

    XMLWordPrintable

Details

    Description

      The test starts 3 nodes concurrently, but configures Infinispan to wait for a cluster of 4 nodes, and expects that the nodes fail to start in initialClusterTimeout + 1 second.

      However, because of a bug in TEST_PING, the first 2 nodes see each other as coordinator and send a JOIN request to each other, and it takes 3 seconds to recover and start the cluster properly.

      The bug in TEST_PING is actually a hack introduced for ISPN-5106. The problem was that the first node (A) to start would install a view with itself as the single node, but the second node to start (B) would start immediately, and the discovery request from B would reach B's TEST_PING before it saw the view. That way, B could choose itself as the coordinator based on the order of A's and B's UUIDs, and the cluster would start as 2 partitions. Since most of our tests actually remove MERGE3 from the protocol stack, the partitions would never merge and the test would fail with a timeout.

      I fixed this in TEST_PING by assuming that the sender of the first discovery response is a coordinator, when there is a single response. This worked because all but a few tests start their managers sequentially, however it sometimes introduces this 3 seconds delay when nodes start in parallel.

      Attachments

        Issue Links

          Activity

            People

              dberinde@redhat.com Dan Berindei (Inactive)
              dberinde@redhat.com Dan Berindei (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: