Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-967

Deadlock in FD_SOCK

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Minor Minor
    • 2.4.7
    • 2.4.5
    • None

      Due to a problem with IPv6 addresses and ServerSocket connections hanging, a deadlock was revealed in FD_SOCK.

      The deadlock reveals itself, for example, in the test RpcDispatcherAnycastTest at the end of the test, when the channels are being torn down. What follows are my original emails to Bela:

      Richard:
      I've tracked down the test case failure of RpcDispatcherAnycastTest under IPv6 to a problem with shutting down the protocol stack. The test executes fine, but in the teardown phase, when the test tries to close the three JChannels which have been set up, the first channel closes correctly, but the second channel hangs.

      JChannel tries to disconnect from the group before shutting down by:
      sending a DISCONNECT event and waits for a DISCONNECT_OK event via a promise
      (ii) sending a STOP_QUEUING event and waita for a return from the call (i.e. has reached the bottom of the stack)

      It then calls ProtocolStack.stopStack() which sends a STOP event down the stack and waits for a STOP_OK event via a promise.
      The STOP event is not making its way correctly down the stack.

      Here is a trace with IPv4 (i've added in some tracing of the STOP event of my own):
      [junit] JChannel.disconnect(): waiting for DISCONNECT_OK
      [junit] JChannel.disconnect(): got DISCONNECT_OK
      [junit] JChannel.disconnect(): stopping queue
      [junit] FD_SOCK.down called, event = STOP_QUEUEING
      [junit] JChannel.disconnect(): stopped queue
      [junit] JChannel.disconnect(): stopping stack
      [junit] ProtocolStack: Sending STOP event
      [junit] STATE_TRANSFER: STOP event received
      [junit] FRAG2: STOP event received
      [junit] FC: STOP event received
      [junit] GMS: STOP event received
      [junit] VIEW_SYNC: STOP event received
      [junit] pbcast.STABLE: STOP event received
      [junit] UNICAST: STOP event received
      [junit] VERIFY_SUSPECT: STOP event received
      [junit] FD: STOP event received
      [junit] FD_SOCK.down called, event = STOP
      [junit] FD_SOCK: STOP event received
      [junit] MERGE2: STOP event received
      [junit] PING: STOP event received
      [junit] ProtocolStack: Received STOP event
      [junit] JChannel.disconnect(): stopped stack

      Here is a bad trace with IPv6:
      [junit] JChannel.disconnect(): waiting for DISCONNECT_OK
      [junit] JChannel.disconnect(): got DISCONNECT_OK
      [junit] JChannel.disconnect(): stopping queue
      [junit] FD_SOCK.down called, event = STOP_QUEUEING
      [junit] JChannel.disconnect(): stopped queue
      [junit] JChannel.disconnect(): stopping stack
      [junit] ProtocolStack: Sending STOP event
      [junit] STATE_TRANSFER: STOP event received
      [junit] FRAG2: STOP event received
      [junit] FC: STOP event received
      [junit] GMS: STOP event received
      [junit] VIEW_SYNC: STOP event received
      [junit] pbcast.STABLE: STOP event received
      [junit] UNICAST: STOP event received
      [junit] VERIFY_SUSPECT: STOP event received
      [junit] FD: STOP event received
      [junit] FD_SOCK.down called, event = MSG
      [junit] FD_SOCK.down called, event = MSG
      [junit] FD_SOCK.down called, event = MSG
      [junit] FD_SOCK.down called, event = MSG

      If I remove FD_SOCK from the stack, the tests pass. If I include it, this stuff happens.

      I also found that if I turn on the uphandler and downhandler threads in FD_SOCK, the problem disappears:
      ...
      <MERGE2 max_interval="30000" down_thread="false" up_thread="false" min_interval="10000"/>
      <FD_SOCK down_thread="true" up_thread="true"/> <FD timeout="10000" max_tries="5" down_thread="false" up_thread="false" shun="true"/>
      ...

      Bela:
      Then it must be a locking issue, I'll take a look tomorrow. Or if you find the solution sooner, all the better !

        1. example0.png
          example0.png
          34 kB
        2. example1.png
          example1.png
          29 kB
        3. threadDump.FD_SOCK.txt
          15 kB
        4. FD_SOCK.java
          48 kB

              rhn-engineering-bban Bela Ban
              rachmato@redhat.com Richard Achmatowicz
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

                Created:
                Updated:
                Resolved: