Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2435

ClientGmsImpl ignores newer view during join

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Major
    • 5.0.0.Alpha3, 4.2.1
    • 4.1.9
    • None

    Description

      We have random failures in a test that starts 4 nodes in parallel (org.infinispan.InitialClusterSizeTest.testInitialClusterSize). I get at least one failure if I add @Test(invocationCount=100), but I did not get any failure when I did the same with org.jgroups.tests.ConcurrentStartupTest.testConcurrentJoinWithLOCAL_PING(), maybe because the Infinispan test sends some additional messages and sometimes changes how messages are processed.

      The problem seems to be that after receiving a JOIN_RSP and installing the first view, the GMS implementation is still ClientGmsImpl when receiving the second view, and the view is ignored. Because this second view already has all 4 members, there is no other VIEW message and the test just times out.

      I added some logs in TP and GMS for debugging, and this is what I see:

      10:37:42,896 TRACE (ucast-receiver-2,Test-NodeC:[]) [UDP] Received oob=false internal=true message GMS: GmsHeader[JOIN_RSP], UNICAST3: DATA, seqno=1, first, TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
      10:37:42,898 TRACE (mcast-receiver-3,Test-NodeC:[]) [UDP] Received oob=false internal=false message GMS: GmsHeader[VIEW], NAKACK2: [MSG, seqno=1], TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
      10:37:42,906 TRACE (jgroups-4,Test-NodeC:[]) [UDP] Test-NodeC: received [Test-NodeB to Test-NodeC, 61 bytes, flags=INTERNAL], headers are GMS: GmsHeader[JOIN_RSP], UNICAST3: DATA, seqno=1, first, TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
      10:37:42,906 TRACE (jgroups-4,Test-NodeC:[]) [GMS] Handling message GMS: GmsHeader[JOIN_RSP], UNICAST3: DATA, seqno=1, first, TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
      10:37:42,907 DEBUG (ForkThread-1,InitialClusterSizeTest:[]) [GMS] Test-NodeC: installing view [Test-NodeB|1] (2) [Test-NodeB, Test-NodeC]
      10:37:42,947 TRACE (mcast-receiver-3,Test-NodeC:[]) [UDP] Received oob=false internal=false message GMS: GmsHeader[VIEW], NAKACK2: [MSG, seqno=2], TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
      10:37:42,948 TRACE (jgroups-5,Test-NodeC:[]) [UDP] Test-NodeC: received message batch of from Test-NodeB: dest=null, sender=Test-NodeB
      1:
      #1: GMS: GmsHeader[VIEW], NAKACK2: [MSG, seqno=1], TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
      10:37:42,948 TRACE (jgroups-5,Test-NodeC:[]) [GMS] Handling message GMS: GmsHeader[VIEW], NAKACK2: [MSG, seqno=1], TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
      10:37:42,948 TRACE (jgroups-5,Test-NodeC:[]) [UDP] Test-NodeC: received message batch of from Test-NodeB: dest=null, sender=Test-NodeB
      1:
      #1: GMS: GmsHeader[VIEW], NAKACK2: [MSG, seqno=2], TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
      10:37:42,948 TRACE (jgroups-5,Test-NodeC:[]) [GMS] Handling message GMS: GmsHeader[VIEW], NAKACK2: [MSG, seqno=2], TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
      ### GmsImpl.handleViewChange() {}
      10:37:42,948 TRACE (jgroups-5,Test-NodeC:[]) [GMS] GmsImpl ignoring view update [Test-NodeB|2] (4) [Test-NodeB, Test-NodeC, Test-NodeA, Test-NodeD]
      10:37:42,950 TRACE (jgroups-5,Test-NodeC:[]) [UDP] Test-NodeC: sending msg to Test-NodeB, src=Test-NodeC, headers are GMS: GmsHeader[VIEW_ACK], UNICAST3: DATA, seqno=3, TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
      ### The ack for the JOIN_RSP is only sent here
      10:37:43,034 TRACE (ForkThread-1,InitialClusterSizeTest:[]) [UDP] Test-NodeC: sending msg to Test-NodeB, src=Test-NodeC, headers are GMS: GmsHeader[VIEW_ACK], UNICAST3: DATA, seqno=4, TP: [cluster=org.infinispan.remoting.transport.InitialClusterSizeTest]
      10:37:43,034 DEBUG (ForkThread-1,InitialClusterSizeTest:[]) [JGroupsTransport] Waiting for 4 nodes, current view has 2
      

      To help debugging, TP.passBatchUp should really log the headers of the messages in the batch, and GMS and the GmsImpl subclasses should log at least a DEBUG message every time they ignore a view (even when log_view_warnings==false). I also suggest removing the default implementations from GmsImpl and updating the ClientGmsImpl javadoc, because it talks about handling ViewChange instead of JoinResponse.

      Attachments

        Activity

          People

            rhn-engineering-bban Bela Ban
            dberinde@redhat.com Dan Berindei (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: