Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1157

TCP: JGroups threads get stuck and stop communicating

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 2.10
    • 2.8, 2.9
    • None
    • Linux

      We are having problem where a node gets isolated after some intermittent network outage and is never able to join back. Bela suspected some issue /w RouterStub and fixed a bug - JGRP-1151. But, we are were able to reproduce this problem /w even JGroups 2.9 GA. Looks like the problem is that the node that gets isolated becomes unresponsive as all its JGroups threads hang. Here is how we reproduced the error /w 3 nodes (Node A - coordinator also running
      Gossip Router, Node B, Node C),

      1) We added WANem between A and B - so there are random disconnects,high packet loss and 200 msec of delay

      2) Due to our WANem setting B loses connectivity /w GR - in and out

      3) We restart A and it becomes isolated. A becomes singleton and never joins back the group. We see NAKACK on the node C - as A is still able to get to C, but not B. C keeps dropping messages from A as A is not in its transmission table.

      4) We turned on tracing on A, but after a while (couple of hours) we see no JGroups trace on A - we suspected that some of the JGroups threads might have got stuck. So we took the thread dump of the java process on A (attached). As you can see there are quite a few JGroups threads in the waiting state and all are for TCP.send

      We are not clear on how or why will the JGroups threads hang. Could outgoing messages be queued up and not moving for some reason?

      The only solution to fix this was to restart all the nodes, which is not desirable.

      Attached are the stack trace (thread-dump) and our protocol stack.

      This jira was originated from discussion at http://sourceforge.net/mailarchive/forum.php?thread_name=4B7BC107.9060304@yahoo.com&forum_name=javagroups-users

        1. threaddump-jg.log
          76 kB
        2. TCPConnection-2.png
          TCPConnection-2.png
          229 kB
        3. TCPConnection-1.png
          TCPConnection-1.png
          236 kB
        4. jgroups_stack.txt
          3 kB

              vblagoje Vladimir Blagojevic (Inactive)
              vivash vivek v (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: