I discovered this while debugging a strange issue that happened earlier today in one of our applications. The problem concerns what GroupRequest (and UnicastRequest, since it has similar logic) does when the view changes. The relevant code from the viewChange() method (from master branch, but the blame says it hasn't changed for 5+ years):
if(!(mbr instanceof SiteAddress) && !view.containsMember(mbr)) { Rsp<T> rsp=entry.getValue(); if(rsp.setSuspected()) { if(!(rsp.wasReceived() || rsp.wasUnreachable())) num_received++; changed=true; } }
This code is supposed to handle the case where a node left the cluster and therefore will not respond. But if I understand how merges work, this logic has a hole in it. The JGroups manual's section on MergeView describes exactly the case that will break.
Suppose you have 2 nodes, A and B. Both nodes currently have view A2={A,B} installed. Then the following sequence of events happens:
- A uses MessageDispatcher to send a request to B.
- B receives the request and starts processing it asynchronously.
- For some reason, B decides that A has died. It generates and installs view B3={B}.
- B finishes processing the request and tries to send the response. But because B believes A is dead, the response gets discarded.
- B notices A is alive. A generates view A4={A,B} and both nodes install it.
At this point, the viewChange() method is called, but A went from A2 directly to A4, so it looks like B never left the cluster. As a result, rsp.setSuspected() is not called, and B will not re-send the response, so the request will never complete.