-
Bug
-
Resolution: Done
-
Major
-
2.10
-
None
We are using Tunnel protocol in JGroups 2.10 GA. I noticed sometimes nodes don't rejoin after GR has gone down and come back. Here is our
scenario,
1) Two nodes: A, B
2) Two Gossip Routers: GR1 (port 4575), GR2 (4574) - only GR1 is up
3) Initially A and B are taking
4) Bring down GR1
5) Both A and B become singleton nodes
6) Bring back up GR1
7) Node A keep getting GR2 in its stubs list,
2011-03-07 16:30:00,263 WARN [Timer-2,vivek,manager_172.16.4.29:3010] TUNNEL - failed sending a message to all members, GR used lt-vivek01.us.packetmotion.com/172.16.4.29:4574 2011-03-07 16:30:01,103 WARN [Timer-2,vivek,manager_172.16.4.29:3010] TUNNEL - failed sending a message to all members, GR used lt-vivek01.us.packetmotion.com/172.16.4.29:4574 2011-03-07 16:30:01,103 ERROR [Timer-2,vivek,manager_172.16.4.29:3010] TUNNEL - failed sending message to null (99 bytes): java.lang.Exception: None of the available stubs [RouterStub[localsocket=0.0.0.0/0.0.0.0:55732,router_host=lt-vivek01.us.packetmotion.com::4574,connected=false], RouterStub[localsocket=0.0.0.0/0.0.0.0:55732,router_host=lt-vivek01.us.packetmotion.com::4574,connected=false]] accepted a multicast message
Note, above Node A has GR2 twice in its stubs list. This causes node A to continue try sending message from the down GR and thus, we never
get a new view with both nodes in it. I tried this test 5 times and out of that 3 failed and 2 passed (got new view after GR was up). In
failed cases node A and B remained singleton even after the GR1 was up.
I'm not sure where this is happening, the only place in Tunnel where we register GR is in "handleDownEvent(..)"
for (InetSocketAddress gr : gossip_router_hosts) {
RouterStub stub =
stubManager.createAndRegisterStub(gr.getHostName(), gr.getPort(),
bind_addr);
stub.setTcpNoDelay(tcp_nodelay);
}
I looked at RouterSubManager code, but don't see how we would get two GR stubs for the same address. Looks like the RouterStub object itself might be getting changed at run-time - I don't see where, but that seems to be most obvious conclusion from this behavior.
Looks like this will cause fail over not to work in a clustered GR setup. I would think if one GR goes down all the communication should start flowing through the second one, but for some reason this is not happening in certain scenarios. This becomes even more critical when using Tunnel protocol
as now all traffic need to pass through GRs.