-
Bug
-
Resolution: Done
-
Major
-
None
-
None
-
False
-
-
False
-
-
When we have multiple unresponsive (e.g. hanging) members that are unable to send back ACKs, then a coordinator might block on sending view changes. This is an edge case, almost never occurs, but requires a fix anyway.
Consider view {A,B,C,D}. A is the coordinator
- Members C and D are unresponsive (e.g. out of memory, kernel panic, severed from power etc)
- A has a full send-window; it hasn't received ACKs from C and D for a while and was therefore not able to reap the send-window (ReliableMulticast)
- A gets a SUSPECT(D) event
- A creates new view V1={A,B,C} and sends it
- A blocks in ReliableMulticast on sending V1: while member D was removed from the expected ACKs, C still doesn't send ACKs (purging and this unblocking A's send-window)
- A gets a SUSPECT(C) event: this would create view V2; however, the sole processor thread in GMS is blocked on sending V1, therefore V2 will not be created until the processing of V1 has completed. This is not happening because V2, which would unblock V1 by removing C from A's send-window is never sent