Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-9588

JGroups fails to install new cluster view after coordinator leave

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

XMLWordPrintable

    • Sprint 10.0.0.Alpha1, Sprint 10.0.0.Alpha2, Sprint 10.0.0.Alpha0, Sprint 10.0.0.Beta1, DataGrid Sprint #31, DataGrid Sprint #32, DataGrid Sprint #33, DataGrid Sprint #34, DataGrid Sprint #35, DataGrid Sprint #36, DataGrid Sprint #37, DataGrid Sprint #38, DataGrid Sprint #39

      The core test suite normally shuts down cache managers in the reverse order of their start, so that the coordinator stops last. This should be just an optimization, to reduce the number of coordinator changes, and with it the test suite duration and log size.

      Unfortunately, the few tests that stop the coordinator first are routinely failing to stop the cluster properly, usually when there are 2 nodes left in the cluster the 2nd node doesn't receive the view:

      10:08:01,656 DEBUG (jgroups-26,Test-NodeB-20661:[]) [GMS] Test-NodeB-20661: installing view [Test-NodeC-25266|33] (3) [Test-NodeC-25266, Test-NodeA-8936, Test-NodeB-20661]
      10:08:01,862 TRACE (testng-Test:[]) [GMS] Test-NodeC-25266: sending LEAVE request to Test-NodeA-8936
      10:08:01,863 DEBUG (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: members are (3) Test-NodeC-25266,Test-NodeA-8936,Test-NodeB-20661, coord=Test-NodeA-8936: I'm the new coordinator
      10:08:01,896 TRACE (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: joiners=[], suspected=[], leaving=[Test-NodeC-25266], new view: [Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
      10:08:01,896 TRACE (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: mcasting view [Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
      10:08:01,900 DEBUG (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: installing view [Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
      10:08:02,245 TRACE (testng-Test:[]) [JGroupsTransport] Test-NodeB-20661 sending request 140 to Test-NodeC-25266: CacheTopologyControlCommand{cache=org.infinispan.CONFIG, type=LEAVE, sender=Test-NodeB-20661, joinInfo=null, topologyId=0, rebalanceId=0, currentCH=null, pendingCH=null, availabilityMode=null, phase=null, actualMembers=null, throwable=null, viewId=33}
      10:12:02,247 DEBUG (testng-Test:[]) [LocalTopologyManagerImpl] Error sending the leave request for cache org.infinispan.CONFIG to coordinator
      org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 140 from 7d5d965d-eacc-fd47-2ab8-5cec95f4c10b
      

      Sometimes the new coordinator gets stuck in a merge loop, keeps trying to re-include the stopped node and fails:

      10:06:21,484 DEBUG (jgroups-24,Test-NodeC-57941:[]) [GMS] Test-NodeC-57941: installing view [Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]
      10:06:21,486 DEBUG (jgroups-25,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing view [Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]
      10:06:22,070 TRACE (testng-Test:[]) [GMS] Test-NodeD-37696: sending LEAVE request to Test-NodeB-39137
      10:06:22,078 TRACE (jgroups-10,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: mcasting view [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
      10:06:22,080 DEBUG (jgroups-10,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing view [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
      10:06:30,887 DEBUG (jgroups-24,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: I will be the merge leader. Starting the merge task. Views: {Test-NodeB-39137=[Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941], Test-NodeC-57941=[Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]}
      10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [STABLE] suspending message garbage collection
      10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [STABLE] Test-NodeB-39137: resume task started, max_suspend_time=220000
      10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge task Test-NodeB-39137::7 started with 2 participants
      10:06:30,888 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: sending MERGE_REQ to [Test-NodeD-37696, Test-NodeB-39137]
      10:06:36,041 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: collected 1 merge response(s) in 5153 ms
      10:06:36,041 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:06:36,043 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: consolidated view=MergeView::[Test-NodeB-39137|10] (2) [Test-NodeB-39137, Test-NodeC-57941], 1 subgroups: [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
      consolidated digest=Test-NodeB-39137: [15 (16)], Test-NodeC-57941: [5 (5)]
      10:06:36,043 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing merge view [Test-NodeB-39137|10] (2 members) in 1 coords
      10:06:36,043 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge Test-NodeB-39137::7 took 5155 ms
      10:06:36,043 TRACE (jgroups-24,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: mcasting view MergeView::[Test-NodeB-39137|10] (2) [Test-NodeB-39137, Test-NodeC-57941], 1 subgroups: [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
      10:06:49,877 DEBUG (MergeTask-229,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:07:03,755 DEBUG (MergeTask-342,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:07:17,703 DEBUG (MergeTask-447,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:07:31,547 DEBUG (MergeTask-568,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:07:45,381 DEBUG (MergeTask-688,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:07:59,196 DEBUG (MergeTask-806,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:08:13,031 DEBUG (MergeTask-932,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:08:26,885 DEBUG (MergeTask-1061,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:08:40,697 DEBUG (MergeTask-1197,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:08:54,510 DEBUG (MergeTask-1330,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:09:08,324 DEBUG (MergeTask-1463,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:09:22,135 DEBUG (MergeTask-1596,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:09:35,953 DEBUG (MergeTask-1729,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:09:49,770 DEBUG (MergeTask-1857,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:10:03,609 DEBUG (MergeTask-1986,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:10:17,420 DEBUG (MergeTask-2117,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:10:31,252 DEBUG (MergeTask-2248,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:10:45,099 DEBUG (MergeTask-2383,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:10:58,916 DEBUG (MergeTask-2516,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      10:11:12,743 DEBUG (MergeTask-2648,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
      

              dberinde@redhat.com Dan Berindei (Inactive)
              dberinde@redhat.com Dan Berindei (Inactive)
              Archiver:
              rhn-support-adongare Amol Dongare

                Created:
                Updated:
                Resolved:
                Archived: