Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2985

NAKACK4 can deadlock/block permanently with FixedBuffer when xmit_from_random_member=true

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 5.5.3
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      Summary

      In a cluster, NAKACK4 using FixedBuffer can end up with a permanently full send/receive window and block the sender after a finite number of multicast messages. This happens when xmit_from_random_member=true, because it effectively disables discarding of delivered messages (discard_delivered_msgs=false), preventing the receive windows from being purged and the sender from making forward progress.

      Setup protocol stack

       

      Protocol[] protStack = {
                new TCP()
                    .setBindPort(bindPort)
                    .setBindAddress(bindAddress)
                    .setExternalAddr(externalAddress)
                    .setValue("enable_suspect_events", true),
      
                new TCPPING()
                    .setPortRange(0)
                    .initialHosts(initialHosts),
      
                new MERGE3(),
      
                new FD_ALL3(),
      
                new VERIFY_SUSPECT2(),
      
                new NAKACK4()
                    .capacity(100)
                    .setXmitFromRandomMember(true),
      
                new GMS(),
      
                new FRAG3()
            };

      Steps to reproduce:

      1. Start cluster with two members (A and B)
      2. A tries to multicast 400 messages

      Observed behavior:

      • B blocks after 100 messages
      • A blocks indefinitely after 200 messages because it did not receive ACKs for last 100 messages and the send window is full
      • Receive windows (xmit_table entries) are not purged because delivered messages are retained when discard_delivered_msgs=false
      • Adding STABLE to the stack does not resolve the issue (NAKACK4 does not rely on STABLE to purge state)
      • Same happens for clusters with 3 members

      Expected behavior:

      Even when xmit_from_random_member=trueNAKACK4 should eventually be able to reclaim buffer space once all members have delivered/ACKed the messages.

       

              rhn-engineering-bban Bela Ban
              sipris Denis Priss
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: