Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2486

FD Monitor get stuck on TrasferQueueBundler

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 4.2.5, 4.0.24
    • 4.0.22
    • None
    • Hide

      Lukas's fix applied to the 4.x branch

      Show
      Lukas's fix applied to the 4.x branch
    • Hide

      Package the attached Main.java with JGroups 4.0.23 (I couldn't select this version in Jira) and run the jar on 2 separate machines. The application uses 3 program arguments

      1. IP of this machine
      2. IP of node1
      3. IP of node2.

      Stop the network interface on one of the machines (ifconfig eth0 down).
      The remaining node stops sending heartbeats once the bundler reaches the limit and the other node is never removed from the view.
      The reproducer may not be 100% reliable but works most of the time.
      I've reproduced the issue on AWS instances with Linux OS.

      Show
      Package the attached Main.java with JGroups 4.0.23 (I couldn't select this version in Jira) and run the jar on 2 separate machines. The application uses 3 program arguments IP of this machine IP of node1 IP of node2. Stop the network interface on one of the machines (ifconfig eth0 down). The remaining node stops sending heartbeats once the bundler reaches the limit and the other node is never removed from the view. The reproducer may not be 100% reliable but works most of the time. I've reproduced the issue on AWS instances with Linux OS.

      Apparently there is an issue in the FD protocol. When a cluster nodes is disconnected and the disconnect isn't handled by FD_SOCK, FD stops sending heartbeats after a while. This only happens when the queue of the TrasferQueueBundler fills up before the node is suspected.
      The stack trace shows that the FD$Monitor is blocked by the bundler. This is probably the reason why the heartbeat timeouts are not noticed.

        1. Main.java
          7 kB
        2. stack-trace.txt
          1 kB

              rhn-engineering-bban Bela Ban
              lbrandl lukas brandl (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: