The good news is that my testing is currently avoiding
JGRP-1451 type issues. (I'm running with the latest master, plus my pull request 54).
The bad news is, that seems to have unblocked me to find the next problem...
I'm running the usual stress test where I kill and restart members, and verify that the group heals itself. I've managed to get into a situation where:
- A, B, and C all have no view at all (they're all repeatedly sending JOINs that time out)
- D has got stuck with a view
(in which every member except D is in fact a dead instance).
So what's happening on each of A, B and C is:
- perform discovery
- decide based on information from D that the long-dead B is coordinator
- send a JOIN to that dead B
- this times out
Meanwhile D's FD is repeatedly broadcasting that A is suspect, but no-one pays any attention.
In an ideal world, I'd think that it ought to be up to D to spot that something has gone wrong. Eg after a long enough period of reporting that A is suspect without seeing any change of view, it could deduce that there's a problem and become a singleton; or something like that. Then a merge should sort everything out in due course.
I'm actually experimenting with a workaround in which we only allow JOIN attempts to time out some maximum number of times; and if they time out too often the member becomes a singleton. ie I'm making a fix that allows A, B and C to proceed. Then I again expect a merge to sort everything out. This looks a lot easier to code up, and seems a plausible thing to want to do anyway.
I have the test running and will see how this goes overnight. If it looks to work I'll submit a pull request; else I'll think again.