Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-6399

Timeout updating the JGroups view after killing one node

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

XMLWordPrintable

      GMS can sometimes delay the processing of a join/leave request because of JGRP-2028.

      Joiners retry automatically after GMS.join_timeout, so it's not that bad. Leavers, however, don't resend their leave requests, so the delay can be worse.

      Normally, the FD/FD_ALL/FD_SOCK protocols would wake up the ViewHandler thread. But we remove the FD* protocols from the stack in most of our tests, unless the test uses DISCARD. That means the leave request can be delayed until another node leaves:

      16:35:56,247 TRACE (testng-ClusterListenerDistAddListenerTest:) [GMS] NodeB-8309: sending LEAVE request to NodeA-45395
      16:35:56,268 TRACE (OOB-1,NodeA-45395:) [TCP_NIO2] NodeA-45395: received [dst: NodeA-45395, src: NodeB-8309 (3 headers), size=0 bytes, flags=OOB], headers are GMS: GmsHeader[LEAVE_REQ]: mbr=NodeB-8309, UNICAST3: DATA, seqno=22, TP: [cluster_name=ISPN]
      16:35:56,268 TRACE (OOB-1,NodeA-45395:) [UNICAST3] NodeA-45395: delivering NodeB-8309#22
      
      16:36:07,263 ERROR (testng-ClusterListenerDistAddListenerTest:) [UnitTestTestNGListener] Test testMemberJoinsAndRetrievesClusterListenersButMainListenerNodeDiesBeforeInstalled(org.infinispan.notifications.cachelistener.cluster.ClusterListenerDistAddListenerTest) failed.
      org.infinispan.util.concurrent.TimeoutException: Timed out before caches had complete views.  Expected 3 members in each view.  Views are as follows: [[NodeA-45395|3] (4) [NodeA-45395, NodeB-8309, NodeC-53222, NodeD-55165], [NodeA-45395|3] (4) [NodeA-45395, NodeB-8309, NodeC-53222, NodeD-55165], [NodeA-45395|3] (4) [NodeA-45395, NodeB-8309, NodeC-53222, NodeD-55165]]
      
      16:37:07,341 TRACE (testng-ClusterListenerDistAddListenerTest:) [GMS] NodeD-55165: sending LEAVE request to NodeA-45395
      16:37:07,361 TRACE (OOB-4,NodeA-45395:) [TCP_NIO2] NodeA-45395: received [dst: NodeA-45395, src: NodeD-55165 (3 headers), size=0 bytes, flags=OOB], headers are GMS: GmsHeader[LEAVE_REQ]: mbr=NodeD-55165, UNICAST3: DATA, seqno=21, TP: [cluster_name=ISPN]
      16:37:07,361 TRACE (OOB-4,NodeA-45395:) [UNICAST3] NodeA-45395: delivering NodeD-55165#21
      16:37:07,361 TRACE (ViewHandler,NodeA-45395:) [GMS] NodeA-45395: joiners=[], suspected=[], leaving=[NodeB-8309], new view: [NodeA-45395|4] (3) [NodeA-45395, NodeC-53222, NodeD-55165]
      

      FD_ALL is pretty cheap: it just sends a message every second, without opening any new sockets. So I think we should enable it by default, and only enable FD_SOCK with TransportFlags.withFD(true).

              dberinde@redhat.com Dan Berindei (Inactive)
              dberinde@redhat.com Dan Berindei (Inactive)
              Archiver:
              rhn-support-adongare Amol Dongare

                Created:
                Updated:
                Resolved:
                Archived: