I reproduced this with 3 members, but may be possible with only 2. FD and VERIFY_SUSPECT must be present in the configuration. The steps to reproduce are the following:
- jgroups with 3 members online
- disconnect a member (not the coord)
- wait until the member disconnected suspects the other two (FD will generate the suspect event for both), but before it changes its view (before VERIFY_SUSPECT confirms the suspection), and reconnect it.
When both suspection occured, FD will have stopped its monitor task (since it had no pingable members). When the unsuspect event is generated, the FD will not restart its monitor task. As a consequence of this, if the other members removed this member from their view, this member will not be shunned (assuming shun=true in FD), since FD is not sending heartbeat request . This member's FD also will not be able to identify any failure, since its monitor task is stopped (I think it will be restarted only if something triggers a VIEW_CHANGE).
I tried to change the unsuspect method in FD to update the pingable_members and ping_dest and restart the monitor task (something like the implementation for processing a VIEW_CHANGE event) and it seemed to correct this problem.
I also noticed ping_dest is not being sychronized in the monitor task. Instead of using a synchronized block (to prevent a bottleneck), I think it should be copied to a local variable so it is thread_safe (would prevent checking one member and suspecting another because the ping_dest changed). I did not reproduced this, I just noticed it looking at the source code.