-
Bug
-
Resolution: Done
-
Major
-
None
-
None
-
False
-
-
False
Summary
In a cluster, NAKACK4 using FixedBuffer can end up with a permanently full send/receive window and block the sender after a finite number of multicast messages. This happens when xmit_from_random_member=true, because it effectively disables discarding of delivered messages (discard_delivered_msgs=false), preventing the receive windows from being purged and the sender from making forward progress.
Setup protocol stack
Protocol[] protStack = {
new TCP()
.setBindPort(bindPort)
.setBindAddress(bindAddress)
.setExternalAddr(externalAddress)
.setValue("enable_suspect_events", true),
new TCPPING()
.setPortRange(0)
.initialHosts(initialHosts),
new MERGE3(),
new FD_ALL3(),
new VERIFY_SUSPECT2(),
new NAKACK4()
.capacity(100)
.setXmitFromRandomMember(true),
new GMS(),
new FRAG3()
};
Steps to reproduce:
- Start cluster with two members (A and B)
- A tries to multicast 400 messages
Observed behavior:
- B blocks after 100 messages
- A blocks indefinitely after 200 messages because it did not receive ACKs for last 100 messages and the send window is full
- Receive windows (xmit_table entries) are not purged because delivered messages are retained when discard_delivered_msgs=false
- Adding STABLE to the stack does not resolve the issue (NAKACK4 does not rely on STABLE to purge state)
- Same happens for clusters with 3 members
Expected behavior:
Even when xmit_from_random_member=true, NAKACK4 should eventually be able to reclaim buffer space once all members have delivered/ACKed the messages.