Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 2.12.1
Affects Version/s: 2.10
Labels:
None
Environment:

Linux, Windows

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

We are using Tunnel protocol in JGroups 2.10 GA. I noticed sometimes nodes don't rejoin after GR has gone down and come back. Here is our
scenario,

1) Two nodes: A, B
2) Two Gossip Routers: GR1 (port 4575), GR2 (4574) - only GR1 is up
3) Initially A and B are taking
4) Bring down GR1
5) Both A and B become singleton nodes
6) Bring back up GR1
7) Node A keep getting GR2 in its stubs list,

2011-03-07 16:30:00,263 WARN  [Timer-2,vivek,manager_172.16.4.29:3010]
TUNNEL - failed sending a message to all members, GR used
lt-vivek01.us.packetmotion.com/172.16.4.29:4574
2011-03-07 16:30:01,103 WARN  [Timer-2,vivek,manager_172.16.4.29:3010]
TUNNEL - failed sending a message to all members, GR used
lt-vivek01.us.packetmotion.com/172.16.4.29:4574
2011-03-07 16:30:01,103 ERROR [Timer-2,vivek,manager_172.16.4.29:3010]
TUNNEL - failed sending message to null (99 bytes):
java.lang.Exception: None of the available stubs
[RouterStub[localsocket=0.0.0.0/0.0.0.0:55732,router_host=lt-vivek01.us.packetmotion.com::4574,connected=false],
RouterStub[localsocket=0.0.0.0/0.0.0.0:55732,router_host=lt-vivek01.us.packetmotion.com::4574,connected=false]]
accepted a multicast message

Note, above Node A has GR2 twice in its stubs list. This causes node A to continue try sending message from the down GR and thus, we never
get a new view with both nodes in it. I tried this test 5 times and out of that 3 failed and 2 passed (got new view after GR was up). In
failed cases node A and B remained singleton even after the GR1 was up.

I'm not sure where this is happening, the only place in Tunnel where we register GR is in "handleDownEvent(..)"

 for (InetSocketAddress gr : gossip_router_hosts) {
          RouterStub stub =
stubManager.createAndRegisterStub(gr.getHostName(), gr.getPort(),
bind_addr);
           stub.setTcpNoDelay(tcp_nodelay);
 }

I looked at RouterSubManager code, but don't see how we would get two GR stubs for the same address. Looks like the RouterStub object itself might be getting changed at run-time - I don't see where, but that seems to be most obvious conclusion from this behavior.

Looks like this will cause fail over not to work in a clustered GR setup. I would think if one GR goes down all the communication should start flowing through the second one, but for some reason this is not happening in certain scenarios. This becomes even more critical when using Tunnel protocol
as now all traffic need to pass through GRs.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Hide
jgroups-2.12.1.Beta1.jar
2011/03/22 10:33 AM
2.75 MB
Bela Ban
Extracting archive...
Show
jgroups-2.12.1.Beta1.jar
2011/03/22 10:33 AM
2.75 MB
Bela Ban
tunnel.xml
2011/03/22 10:33 AM
2 kB
Bela Ban

Assignee:: Vladimir Blagojevic (Inactive)

Reporter:: vivek v (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2011/03/14 5:05 PM

Updated:: 2011/04/12 1:29 AM

Resolved:: 2011/04/12 1:29 AM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates