[JGRP-1252] TCP Gossip Discovery Issue

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: 2.12
Affects Version/s: 2.11
Labels:
- new_and_noteworthy

Estimated Difficulty:
Low
Forum Reference:
http://old.nabble.com/TCPGossip-Discovery-Issue-td30227966.html
SourceForge Reference:
http://old.nabble.com/TCPGossip-Discovery-Issue-td30227966.html
Steps to Reproduce:

Hide

As described in description

Show
As described in description

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

I run the chat demo app that was shipped with an older version of Jgroups. Using tcp transport, with tcpgossip for discovery I start up 2 instances of the chat application. I then restart the gossip server and also another instance of the chat application. The 3rd instance of the chat application receives a view update (membershipListener.viewAccepted) only the logical name of one of the 2 previous instances of the chat client is incorrect. I have detailed the results in: http://old.nabble.com/TCPGossip-Discovery-Issue-td30227966.html

I will attach the test client to this bug report.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

JGroupsTest.zip
5.00 MB
2011/01/04 6:10 AM
JGroupsTest.zip
4.65 MB
2010/12/15 4:11 AM
JGroupsTest.zip
2.40 MB
2010/12/14 5:10 AM
JGroupsTest.zip
2.39 MB
2010/11/18 5:55 AM
JGRP-1252-output.txt
9 kB
2010/12/03 11:45 AM
JGRP-1252-output2.txt
8 kB
2010/12/06 7:21 AM

Grahame Rogers (Inactive) added a comment - 2011/03/02 2:52 AM

provided with several viable workarounds

Grahame Rogers (Inactive) added a comment - 2011/03/02 2:52 AM provided with several viable workarounds

Grahame Rogers (Inactive) added a comment - 2011/03/02 2:51 AM

ok, will look at this. Thanks for all the assistance, I now have many different solutions to this scenario, so am closing this incident

Grahame Rogers (Inactive) added a comment - 2011/03/02 2:51 AM ok, will look at this. Thanks for all the assistance, I now have many different solutions to this scenario, so am closing this incident

Bela Ban added a comment - 2011/03/02 12:41 AM

Yes, that makes sense, as chances of not receiving the logical name are diminished with return_entire_cache=true. Depending on your version, you might also have to set ergonomics=false and/or max_rank=0

Bela Ban added a comment - 2011/03/02 12:41 AM Yes, that makes sense, as chances of not receiving the logical name are diminished with return_entire_cache=true. Depending on your version, you might also have to set ergonomics=false and/or max_rank=0

Grahame Rogers (Inactive) added a comment - 2011/03/01 3:25 PM

Question. Whilst trying out the suggested solutions, I accidentally came across a setting which appears to fix the problem. In TCPGOSSIP I set return_entire_cache = true and timeout=10000 and I was unable to reproduce the problem. Does this sound plausible? The only thing I occasionally notice is that a client starts up with a view containing only themselves, then a merge quickly occurs to fix. Usually this happens on the second client to connect and any subsequent clients tend to be fine.

Grahame Rogers (Inactive) added a comment - 2011/03/01 3:25 PM Question. Whilst trying out the suggested solutions, I accidentally came across a setting which appears to fix the problem. In TCPGOSSIP I set return_entire_cache = true and timeout=10000 and I was unable to reproduce the problem. Does this sound plausible? The only thing I occasionally notice is that a client starts up with a view containing only themselves, then a merge quickly occurs to fix. Usually this happens on the second client to connect and any subsequent clients tend to be fine.

Bela Ban added a comment - 2011/01/25 12:09 PM

OK. Can you close this issue if the programmatic solution fixes your problem ?

Note that you could always use something else than TCPGOSSIP, e.g. a FILE_PING over an NFS mounted (or otherwise shared) volume would not get you into this situation...

Bela Ban added a comment - 2011/01/25 12:09 PM OK. Can you close this issue if the programmatic solution fixes your problem ? Note that you could always use something else than TCPGOSSIP, e.g. a FILE_PING over an NFS mounted (or otherwise shared) volume would not get you into this situation...

Grahame Rogers (Inactive) added a comment - 2011/01/25 11:56 AM

I rarely get a merge in this scenario, just a list containing the right amount of views, with generally one member with the duff name, for exactly the reason you describe. Not sure I can really control delaying client 4 logging in. To be honest, I think the scenario could occur with the existing nodes logging back in after a GR restart, as this is a timing issue - may be wrong. I think I will try out the programatic solution that you describe, this sounds like a pragmatic solution to what is just an edge case.

Grahame Rogers (Inactive) added a comment - 2011/01/25 11:56 AM I rarely get a merge in this scenario, just a list containing the right amount of views, with generally one member with the duff name, for exactly the reason you describe. Not sure I can really control delaying client 4 logging in. To be honest, I think the scenario could occur with the existing nodes logging back in after a GR restart, as this is a timing issue - may be wrong. I think I will try out the programatic solution that you describe, this sounds like a pragmatic solution to what is just an edge case.

Bela Ban added a comment - 2011/01/25 11:00 AM

So, first of all, your view is correct (possibly after merging), even if it's duff, isn't it ?

No, your config is correct. It's just the GossipRouter which maintains a table of members and their associated information (e.g. IP address, logical name etc).

When GossipRouter is restarted, its table is empty until the existing members re-register.

If the existing members 1-3 don't re-register before the new client is started, then you'll have a duff view, plus the cluster won't form correctly, until a merge happens.

If the existing member re-register before the new client is started, then everything will be fine.

You could reduce TCPGOSSIP.reconnect_interval (default is 10 secs), but then there's a lot of activity going on when the GossipRouter is down.

In any case, if you wait for more than 10 second after restarting GR, but before starting client 4, the existing members should have re-registered and GR should have a fully populated table, so client 4 gets a correct view.

Note that you can trigger the fetching of this information by calling

probe.sh op=TCPGOSSIP.findInitialMembersAsString

Alternatively, everyone could do this programatically:
JChannel ch;
Discovery discovery_prot=(Discovery)ch.getProtocolStack().findProtocol(TCPGOSSIP.class);
discovery_prot.findInitialMembersAsString();

Bela Ban added a comment - 2011/01/25 11:00 AM So, first of all, your view is correct (possibly after merging), even if it's duff, isn't it ? No, your config is correct. It's just the GossipRouter which maintains a table of members and their associated information (e.g. IP address, logical name etc). When GossipRouter is restarted, its table is empty until the existing members re-register. If the existing members 1-3 don't re-register before the new client is started, then you'll have a duff view, plus the cluster won't form correctly, until a merge happens. If the existing member re-register before the new client is started, then everything will be fine. You could reduce TCPGOSSIP.reconnect_interval (default is 10 secs), but then there's a lot of activity going on when the GossipRouter is down. In any case, if you wait for more than 10 second after restarting GR, but before starting client 4, the existing members should have re-registered and GR should have a fully populated table, so client 4 gets a correct view. Note that you can trigger the fetching of this information by calling probe.sh op=TCPGOSSIP.findInitialMembersAsString Alternatively, everyone could do this programatically: JChannel ch; Discovery discovery_prot=(Discovery)ch.getProtocolStack().findProtocol(TCPGOSSIP.class); discovery_prot.findInitialMembersAsString();

Grahame Rogers (Inactive) added a comment - 2011/01/25 7:02 AM

Now this is getting very interesting, thanks for these last comments. I reverted back to my original manual test via a chat swing GUI and was able to reproduce in 2.1.12. However, I added a line of code within the message received method where I now call channel.getView and print this out. You are quite correct, I am seeing the scenario where client4 starts up and receives the duff view. I then send a message on any of the clients and when this is received by client4, channel.getView now returns the correct view data.

The reason I have discovered this issue in the first place is because we are initially looking at adding Jgroups in a way where a GUI shows which nodes are up/down and therefore does not send/receive any application messages.

Is it possible that my config is just not correct and by configuring stability/FD this problem could be fixed by config? If not, perhaps I could introduce some pragmatic handling where, in the event of me suspecting a duff view, I broadcast a message in order to fix. What do you think?

Grahame Rogers (Inactive) added a comment - 2011/01/25 7:02 AM Now this is getting very interesting, thanks for these last comments. I reverted back to my original manual test via a chat swing GUI and was able to reproduce in 2.1.12. However, I added a line of code within the message received method where I now call channel.getView and print this out. You are quite correct, I am seeing the scenario where client4 starts up and receives the duff view. I then send a message on any of the clients and when this is received by client4, channel.getView now returns the correct view data. The reason I have discovered this issue in the first place is because we are initially looking at adding Jgroups in a way where a GUI shows which nodes are up/down and therefore does not send/receive any application messages. Is it possible that my config is just not correct and by configuring stability/FD this problem could be fixed by config? If not, perhaps I could introduce some pragmatic handling where, in the event of me suspecting a duff view, I broadcast a message in order to fix. What do you think?

Bela Ban added a comment - 2011/01/25 4:37 AM

OK. Note that if you send a message (e.g. by drawing something in the canvas of Draw, or sending a chat message), and the physical address of the target(s) is not known, then JGroups will fetch that information from the GossipRouter. So if you see that the membership is correct, but some members don't have logical names, you can trigger this fetch...

In a normal app, this would happen when a message is sent. JGroups does send messages regularly, e.g. stability or failure detection messages.

Bela Ban added a comment - 2011/01/25 4:37 AM OK. Note that if you send a message (e.g. by drawing something in the canvas of Draw, or sending a chat message), and the physical address of the target(s) is not known, then JGroups will fetch that information from the GossipRouter. So if you see that the membership is correct, but some members don't have logical names, you can trigger this fetch... In a normal app, this would happen when a message is sent. JGroups does send messages regularly, e.g. stability or failure detection messages.

Grahame Rogers (Inactive) added a comment - 2011/01/25 4:31 AM

Hi, let me re-test again on 2.12. I increased my wait previously from 20 seconds to 60 seconds, but this was on a much earlier version

Grahame Rogers (Inactive) added a comment - 2011/01/25 4:31 AM Hi, let me re-test again on 2.12. I increased my wait previously from 20 seconds to 60 seconds, but this was on a much earlier version

Assignee:: Vladimir Blagojevic (Inactive)

Reporter:: Grahame Rogers (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2010/11/18 5:44 AM

Updated:: 2011/03/02 2:52 AM

Resolved:: 2011/01/24 6:08 AM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

Collapse comment: Grahame Rogers (Inactive) added a comment - 2011/03/02 2:52 AM

Expand comment: Grahame Rogers (Inactive) added a comment - 2011/03/02 2:52 AM

Collapse comment: Grahame Rogers (Inactive) added a comment - 2011/03/02 2:51 AM

Expand comment: Grahame Rogers (Inactive) added a comment - 2011/03/02 2:51 AM

Collapse comment: Bela Ban added a comment - 2011/03/02 12:41 AM

Expand comment: Bela Ban added a comment - 2011/03/02 12:41 AM

Collapse comment: Grahame Rogers (Inactive) added a comment - 2011/03/01 3:25 PM

Expand comment: Grahame Rogers (Inactive) added a comment - 2011/03/01 3:25 PM

Collapse comment: Bela Ban added a comment - 2011/01/25 12:09 PM

Expand comment: Bela Ban added a comment - 2011/01/25 12:09 PM

Collapse comment: Grahame Rogers (Inactive) added a comment - 2011/01/25 11:56 AM

Expand comment: Grahame Rogers (Inactive) added a comment - 2011/01/25 11:56 AM

Collapse comment: Bela Ban added a comment - 2011/01/25 11:00 AM

Expand comment: Bela Ban added a comment - 2011/01/25 11:00 AM

Collapse comment: Grahame Rogers (Inactive) added a comment - 2011/01/25 7:02 AM

Expand comment: Grahame Rogers (Inactive) added a comment - 2011/01/25 7:02 AM

Collapse comment: Bela Ban added a comment - 2011/01/25 4:37 AM

Expand comment: Bela Ban added a comment - 2011/01/25 4:37 AM

Collapse comment: Grahame Rogers (Inactive) added a comment - 2011/01/25 4:31 AM

Expand comment: Grahame Rogers (Inactive) added a comment - 2011/01/25 4:31 AM

People

Dates