Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2904

NAKACK4 blocked sender (race condition?)

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 5.5.0, 5.4.9
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      I have a service which sends multicast OOB messages (only), all from a single thread. Messages are sent without DONT_LOOPBACK.
      Latest JGroups 5.4.8, a more or less standard protocol stack with NAKACK4 and TCP_NIO2.
      NAKACK4 capacity is set to 11k.

      3 instances A, B, C of the service were running for several weeks without any issues.
      A rolling restart was done and appeared to finished without issue, all members updated their view correctly.
      After some hours, first member A stopped sending messages, the send thread was blocked on JChannel.send().

      Looking at the nakack4_num_unacked_messages metric, I can see that almost directly after restart it started increasing, and increased to 11001 before the sender was blocked.
      nakack4_current_seqno metric stops at 11087, nakack4_num_messages_sent metric stops at 11086.
      NAKACK4 log_not_found_msgs was set to default true but there were never any logs about missing messages (or any logs from JGroups at all really).

      When I killed one of other instances (C), the previously blocked A logged NAKACK4 - A: removed C from xmit_table (not member anymore) and then it was unblocked.

      I suspect that there could be some race condition in NAKACK4 which lead to this, and that it may be related to this service sending only OOB messages. It has now happened twice, first time some some weeks ago, and once more just now.

      I have dozens of different services across hundreds of instances using same JGroups setup and have never seen this, but these other services do not send OOB messages.

      How to investigate?
      I have all JGroups metrics available from when this happened.

              rhn-engineering-bban Bela Ban
              cfredri4 Christian Fredriksson
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: