Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1303

Multiple Gossip Routers not working with TUNNEL

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 2.12.1
    • 2.10
    • None

      We are using Tunnel protocol in JGroups 2.10 GA. I noticed sometimes nodes don't rejoin after GR has gone down and come back. Here is our
      scenario,

      1) Two nodes: A, B
      2) Two Gossip Routers: GR1 (port 4575), GR2 (4574) - only GR1 is up
      3) Initially A and B are taking
      4) Bring down GR1
      5) Both A and B become singleton nodes
      6) Bring back up GR1
      7) Node A keep getting GR2 in its stubs list,

      2011-03-07 16:30:00,263 WARN  [Timer-2,vivek,manager_172.16.4.29:3010]
      TUNNEL - failed sending a message to all members, GR used
      lt-vivek01.us.packetmotion.com/172.16.4.29:4574
      2011-03-07 16:30:01,103 WARN  [Timer-2,vivek,manager_172.16.4.29:3010]
      TUNNEL - failed sending a message to all members, GR used
      lt-vivek01.us.packetmotion.com/172.16.4.29:4574
      2011-03-07 16:30:01,103 ERROR [Timer-2,vivek,manager_172.16.4.29:3010]
      TUNNEL - failed sending message to null (99 bytes):
      java.lang.Exception: None of the available stubs
      [RouterStub[localsocket=0.0.0.0/0.0.0.0:55732,router_host=lt-vivek01.us.packetmotion.com::4574,connected=false],
      RouterStub[localsocket=0.0.0.0/0.0.0.0:55732,router_host=lt-vivek01.us.packetmotion.com::4574,connected=false]]
      accepted a multicast message
      

      Note, above Node A has GR2 twice in its stubs list. This causes node A to continue try sending message from the down GR and thus, we never
      get a new view with both nodes in it. I tried this test 5 times and out of that 3 failed and 2 passed (got new view after GR was up). In
      failed cases node A and B remained singleton even after the GR1 was up.

      I'm not sure where this is happening, the only place in Tunnel where we register GR is in "handleDownEvent(..)"

       for (InetSocketAddress gr : gossip_router_hosts) {
                RouterStub stub =
      stubManager.createAndRegisterStub(gr.getHostName(), gr.getPort(),
      bind_addr);
                 stub.setTcpNoDelay(tcp_nodelay);
       }
      

      I looked at RouterSubManager code, but don't see how we would get two GR stubs for the same address. Looks like the RouterStub object itself might be getting changed at run-time - I don't see where, but that seems to be most obvious conclusion from this behavior.

      Looks like this will cause fail over not to work in a clustered GR setup. I would think if one GR goes down all the communication should start flowing through the second one, but for some reason this is not happening in certain scenarios. This becomes even more critical when using Tunnel protocol
      as now all traffic need to pass through GRs.

        1. tunnel.xml
          2 kB
          Bela Ban

              vblagoje Vladimir Blagojevic (Inactive)
              vivash vivek v (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: